Determining When Objects Die to Improve
Garbage Collection
A dissertation submitted by
Nathan P. Ricci
In partial fulfillment of the requirements
for the degree of Doctor of Philosophy in
Computer Science
TUFTS UNIVERSITY
May 2016
ADVISER: Samuel Z. Guyer ii 0.1 Abstract
Although garbage collection frees programmers from the burden of manually free- ing objects that will no longer be used, and eliminates many common programming errors, it can come with a significant performance penalty. This penalty can become particularly great when memory is constrained.
This thesis introduces two pieces of work to address this problem. The first is
Elephant Tracks, a tool that uses novel techniques to measure memory properties of Java programs Elephant Tracks is novel in that it places object death events more accurately than any existing tool, and is able to do so with out modification to the underlying VM.
The second is the Deferred Garbage collector. Built based on observations from the Elephant Tracks traces, the Deferred Garbage Collector reduces redundant work performed by the garbage collector In this thesis, we show that the techniques used by the Deferred Collector can reduce garbage collector tracing workload in some programs.
iii 0.2 Acknowledgements
I would like to thank the Tufts University Department of Computer Science for building the environment that makes this work possible. I am especially grateful for the patient mentoring of my advisor, Sam Guyer. I also wish to express gratitude to my comitee members; Norman Ramsey, Alva Couch, Mark Hempstead, and Tony
Printezis.
Additionally, I would like to thank my wife Gizem, and my parents Susan, and
David, without whose support I never would have made it here.
I would also be remiss if I did not thank the Tufts Department of Athletics, for providing barbells, without which I would have long ago gone mad.
Lastly, I must also thank Tufts University and the National Science Foundation for their financial support.
iv Contents
0.1 Abstract ...... iii
0.2 Acknowledgements ...... iv
1 Introduction 1
2 Garbage Collection Background 3
2.1 On Object Lifetime ...... 3
2.2 Garbage Collector Overhead ...... 4
2.2.1 Mark Sweep ...... 6
2.2.2 Generational Mark Sweep ...... 7
2.3 Weak References ...... 8
2.4 The Java Virtual Machine Tool Interface ...... 9
3 Elephant Tracks 11
3.1 Elephant Tracks Introduction ...... 11
3.2 Trace time ...... 13
3.3 Background and related work ...... 15
3.3.1 Garbage collection tracing ...... 16
3.3.2 Why a new trace generator? ...... 19
3.3.3 Related work ...... 21
3.4 Elephant Tracks Design ...... 23
3.4.1 Kinds of trace records ...... 23
v 3.4.2 Execution model ...... 25
3.5 Implementation ...... 29
3.5.1 Timestamping strategy ...... 30
3.5.2 The instrumenter ...... 31
3.5.3 The agent ...... 36
3.5.4 Properties of our implementation approach ...... 39
3.6 Results ...... 40
3.6.1 Performance ...... 40
3.6.2 Trace analysis ...... 42
3.7 Conclusions ...... 43
4 Deferred Collector 49
4.1 Finding Key Objects ...... 50
4.2 Deferred Collector Design ...... 54
4.2.1 Defer all Object Reachable from the Key ...... 54
4.2.2 Large Object Space ...... 56
4.3 Mitigating Bad Hints ...... 57
4.3.1 Application Programming Interface ...... 58
4.4 JikesRVM ...... 58
4.4.1 Sanity Checker ...... 60
4.5 Experimental Results for The Deferred Collector ...... 61
4.5.1 Methodology ...... 61
4.5.2 Doubly Linked List ...... 62
4.5.3 sunflow ...... 63
4.6 Related Work ...... 66
5 Conclusion 68
vi Bibliography 68
vii List of Figures
2.1 The events in the life of an object. First it is allocated, then it is
used. Eventually, it is used for the last time. Sometime after this, it
may become unreachable. Finally its memory will be reclaimed. . .4
2.3 A weak reference object has a field called a ”referent” that refers
to some object (A). When the object referred to by the referent ob-
ject is reachable only through weak reference objects, it may be
collected. If this happens, the referent field is nulled (B)...... 9
3.1 Pseudo code for the merlin algorithm...... 18
4.1 The live size of each of the dacapo bench benchmaks with respect
to time. The blue line shows the number of objects live over time.
The lower green dashed line shows the number of live objects that
fall into clusters as described in 4.1 ...... 51
4.1 The live size of each of the dacapo bench benchmaks with respect
to time. The blue line shows the number of objects live over time.
The lower green dashed line shows the number of live objects that
fall into clusters and described in 4.1 ...... 52
viii 4.1 The live size of each of the dacapo bench benchmaks with respect
to time. The blue line shows the number of objects live over time.
The lower green dashed line shows the number of live objects that
fall into clusters and described in 4.1 ...... 53
4.2 mark/cons ratio for the doubly linked list program, run with a 16
MB heap, and differing values of immediate collection period. . . . 63
4.3 mark/cons ratio for sunflow, run with a 27 MB heap, and differing
values of immediate collection period...... 65
4.4 mark/cons ratio for sunflow, run with a 32 MB heap, and differing
values of immediate collection period...... 66
ix Chapter 1
Introduction
Since the task of allocating and freeing memory in computer programs is tedious and error prone, automated memory management (known as garbage collection) has been a boon to computer programmers. However, automated memory manage- ment comes with costs; extra memory overhead or time.
The hypothesis of this thesis is that a greater understanding of the structure of a program’s heap can improve the performance of garbage collectors. Intuitively, this seems obvious; the data structures used in a program govern in large part how the memory of a program is used, and understanding those data structures should give us more information about the lifetime of objects.
However, in order to examine this intuitive idea, we will need to develop new techniques to observe how programs use memory. In particular, we want to know precisely when objects become unreachable and may be collected. Thus, this thesis is two parts. In the first half, we present Elephant Tracks, a tool which uses novel techniques to trace the execution of Java programs. The traces produced by Ele- phant Tracks contain a record of all object allocation and death events, and enough information to place these events in a specific calling context (a novel feature of Ele- phant Tracks). These traces can be used to analyze runtime properties of programs,
1 including prototyping of new garbage collection algorithms. Since they place death so precisely, it is easy test schemes that require less precision by ignoring some of the information in the trace. A central claim of this thesis is that that this can be done with out relying on any modification to the underlying VM.
In the second half of this thesis, we present the deferred collector, built based on data gather with Elephant Tracks. The deferred collector exploits the observation that much of the heap does not change between two successive runs of a tracing- style garbage collector, and as a result there is much work repeated between such successive runs. The deferred collector reduces this repeated work based on pro- grammer hints. We claim that this can reduce garbage collector tracing workload in some programs.
2 Chapter 2
Garbage Collection Background
In order to understand the performance characteristics of garbage collectors, it is
necessary to understand how they work. Thus, we will describe their operation in
this section. As there is copious work on garbage collection, this chapter will only
focus on details relevant to this thesis.
2.1 On Object Lifetime
First, we must make a brief aside to explain the events of interest in an objects life-
time. We will borrow terminology first used by Rojemo¨ and Runciman (1996). A program allocates an object, and sometime later it is first used; the period between these events is called the lag. Eventually, the program uses an object for the last time, and this point we can say the object is dead. In principle, it would be safe for the garbage collector to reclaim an objects memory the very instant of its death.
However, in general it is not possible to determine whether a use is the last use.
So, instead the first point the collector could collect an object is later, when it be-
comes unreachable. The period between last use and the time an object becomes
unreachable is called use drag.
3 Lifetime of An Object
Figure 2.1: The events in the life of an object. First it is allocated, then it is used. Eventually, it is used for the last time. Sometime after this, it may become unreach- able. Finally its memory will be reclaimed.
Finally, although the garbage collector could collect an object as soon as the object becomes unreachable, in practice the collector runs only intermittently; the collector will not collect the object until its next run. This engenders another period of drag, the reachability drag.
2.2 Garbage Collector Overhead
How much overhead does garbage collection entail? While there have been many significant advances in garbage collection performance, overhead can still be very high when memory is constrained. To understand why, first consider how most typical garbage collection algorithms work: As the program requests new objects,
4 (a) Heap, memory usage below threshold (b) Heap full (c) Heap after collection
(d) Key: Blue objects are allocated and reach- able. Grey Objects are Al- located, but dead. White space is unallocated. memory for them is allocated on demand, until some threshold of allocated mem- ory, (H for ”heap size”), is reached. Then, the program is paused, and the garbage collector begins executing to reclaim memory. It begins by tracing all objects re- ferred to by the program roots, (the roots are just pointers into the heap from other memory locations, such as the stack). The garbage collector marks these initial ob- jects, and then finds all objects reachable from this set, and marks them, eventually marking all objects that are transitively reachable from the roots.
If the threshold H is larger than the total amount of memory the program allo- cates during its execution, H is never be reached, the program would never have to perform a garbage collection, and the cost of GC is zero. If instead H is a large fraction of the total amount of allocation, but less than the total amount of alloca- tion, the collector runs only infrequently. Imagine the collector has not run in some time , and then performs a new collection. Since most objects eventually die (that is, they are not immortal), and the cost of collection is proportional to the number of live objects, the collection is inexpensive. Furthermore, the collection will re- cover a large amount of memory, which means that it will be a long time before the
5 amount of allocated memory reaches H again. So, if H is large enough collections are infrequent and inexpensive.
In contrast, if H is small, then collections will occur frequently: Each collection will recover only a small amount of memory, leaving only a small amount to be allocated before H is reached again and a new collection must occur. In this case, we are in the unhappy world of frequent and expensive collections.
A direct comparison to manual memory management is not straight forward.
One could do as Hertz and Berger (2004) did and determine the exact unreachable time of every object in a Java program, and replay the program exactly, inserting manual frees at the exact moment of death, comparing that to the standard garbage collection scheme. However, the presence or absence of manual memory manage- ment changes how programs are written, and this experiment does not necessarily reflect the way that manual memory management is actually done. In practice, aux- iliary data structures may be used to track memory, defense copies may be made, or other techniques employed. If the data structures used to track memory are elab- orate enough, they could be seen as a partial implementation of a garbage collector.
However imperfect the analysis, (Hertz and Berger, 2004) concluded that when H is approximately four times the typical live-size of the program, the costs garbage collection are similar to the costs of manual memory management. When H is greater than this, garbage collection is more efficient than manual memory man- agement. However, if we move H in the opposite direction, the costs of garbage collection increases dramatically. With H only twice the live size, the execution of a program is slowed dramatically (Hertz and Berger, 2004).
2.2.1 Mark Sweep
One of the most basic algorithms for garbage collection is Mark Sweep. It proceeds as follows:
6 1. Once the threshold is reached, the mutator is paused
2. The collector enumerates all roots (references on the stack, or references to
global variables, that refer to objects in the heap)
3. The Mark Phase Begins. All objects reachable via the roots are marked, and
all objects transitively reachable from those objects are marked.
4. The Sweep Phase begins. The collector linearly scans through memory,
checking to see if objects are marked. All objects not marked are reclaimed
(”swept”).
5. The mutator resumes.
The marking can be done by modifying a bit in a side table, or the mark bit can be stored in memory adjacent to the object in a special header area. Since the sweeper is read-only with respect to live objects, and the mutator will never interact with unreachable objects, it is possible for the mutator to resume before the sweeper is complete; the mutator and sweeper then run concurrently.
2.2.2 Generational Mark Sweep
One important advance in garbage collector technology exploits an empirical ob- servation of object lifetimes known as the Generational Hypothesis: most objects die young.
In order to take advantage of this, new objects are allocated into a small space called the nursery. When the nursery is filled, survivors are copied to a larger space, called the mature space. This larger space is typically managed with the previously described mark sweep algorithm.
In order that the collector can collect the nursery without having to examine the mature space at all, it needs to keep track of any pointers located in the mature
7 space into that refer to objects in the nursery; such pointers are treated as roots during a nursery collection. In a JVM that relies on a just-in-time (JIT) compiler, this tracking is accomplished by the JIT compiler injecting code into the mutator surrounding any write of a pointer; this code is called the write barrier (a write barrier could also be implemented in an interpreter, or ahead of time compiler).
The write barrier code checks that this condition is met, and if so the address of that pointer is stored in a list called the remembered set.
Some generation collectors use an alternative called card marking instead of a simple list. In this scheme, one bit is associated with each ”card” of memory
(the size of a card is variable, but typically cards fit evenly onto pages, several per page). These bits are stored together in a single bit map called the card table. In the card marking scheme, instead of appending to a list, the write barrier sets the bit associated with the card in which a pointer is being mutated. Then, when a nursery garbage collection occurs, the garbage collector scans those cards that have been marked in the card table. After a collection, the card table is zeroed. Card marking has the advantage of bounded space, since the card table is of fixed size (and the list used in a remembered set is not). However, it has the disadvantage that the collector will have to scan an entire card even if only one pointer in it has been mutated.
2.3 Weak References
As with many other garbage collected languages, Java provides weak references. In
Java, these are implemented using a special container type, WeakReference. When a WeakReference contains an object, it can be accessed via the weak objects get method. Suppose an object is reachable only through WeakReference objects; such an object is said to be weakly reachable. When an object is determined to be weakly reachable, it will be collected. This is illustrated in fig. 2.3 Weak references are
8 Figure 2.3: A weak reference object has a field called a ”referent” that refers to some object (A). When the object referred to by the referent object is reachable only through weak reference objects, it may be collected. If this happens, the referent field is nulled (B). typically used to implement software caches, since it is not desired for the cache itself to keep objects resident in memory.
2.4 The Java Virtual Machine Tool Interface
Java Virtual Machines provide an standard interface for debugging tools called the
Java Virtual Machine Tool Interface (Sun Microsystems, 2004). This interface allows the JVM’s user to specify a shared library, called a JVMTI agent, to be loaded along with the VM. As classes are loaded by the JVM, their byte codes can be intercepted by the agent. The agent can then modify them before they are executed by the JVM.
JVMTI is a relatively sparse interface, and it is intended that agents primarily rely on byte code instrumentation to do most of the agents work. It does, however, provide some additional features beyond intercepting loaded classes. A JVMTI agent can traverse all the live objects in the heap, examine local variables on the stack, and intercept Java Native Interface calls.
9 In order to carry out the bytecode instrumentation used in the work presented in this thesis, we use version 3.3 of the ASM bytecode rewriting framework. ASM is widely used and actively maintained, and is part of the infrastructure of at least one major commercial JVM (Oracle’s Hotspot).
10 Chapter 3
Elephant Tracks
3.1 Elephant Tracks Introduction
Garbage collection tracing tools have been instrumental in the development of new garbage collection algorithms. A GC tracing tool produces an accurate trace of all the dynamic program events that are relevant to memory management, including allocations, pointer updates, and object deaths. We can quickly test a new GC algo- rithm by building a simulator that reads the GC trace, instead of developing a full
GC implementation in a real virtual machine, which is a considerable undertaking.
One of the widely used GC tracing tools for Java, GCTrace, is available as a component of the JikesRVM Java virtual machine (Alpern et al., 2005b). That tool, like ours, is based on the Merlin algorithm (Hertz et al., 2002, 2006), but suffers from several limitations. First, the implementation is integrated directly into the garbage collector. Due to the ongoing evolution of the JikesRVM Memory Man- agement Toolkit, it no longer functions with recent versions of JikesRVM, and older versions will not run modern Java software. Second, GCTrace measures time only in terms of bytes allocated, a fine metric for GC simulation, but not useful for pro- gram analysis since it cannot readily be tied back to points in the program. Third,
11 allocation time is not very precise for events other than allocation: many pointer
updates and object deaths can occur at various points between two allocations. Fi-
nally, the existing tool does not support a number of features found in real programs,
including weak references and multithreading.
In this chapter we present Elephant Tracks, a new GC tracing tool that is pre- cise, informative, and can run on top of any standard JVM. Our goal is not simply to address the limitations of prior work, but to provide new capabilities that allow our tool to support a wider variety of program analysis and runtime systems re- search. The implementation of Elephant Tracks uses a combination of bytecode instrumentation and JVMTI (JVM Tool Interface) callbacks to construct a graph representing the connectivity of the heap, on which it runs the Merlin algorithm to compute idealized object death times. Its attributes include:
Precise: Elephant Tracks measures time in terms of method calls (i.e., the clock
ticks at every method entry and exit), which are much more frequent than
allocations.1
Complete: The implementation properly handles all relevant events, including dif-
ficult cases, such as weak references, the Java Native Interface, sun.misc.Unsafe,
and JVM start up objects. Previous tools did not handle all of this features,
which could result in inaccurate traces if a program makes use of them.
Informative: Traces include much more than just GC-related events. We emit a
record for every method call and return, allowing us to tie memory behavior
back to the program structure. In fact, we can reconstruct the complete dy-
namic calling context for any time step. We also record information about
threads and exceptions, and, optionally, counts of heap reads and writes and
number of bytecodes executed. This ability is unique to Elephant Tracks.
1This includes constructor calls, thus tightly bounding most allocations as well.
12 Well-defined: We carefully define which aspects of the JVM execution model that
the traces capture , which embodies a number of subtle design issues that
affect the meaning of the traces. These include trace time, the definition of
object lifetime, and the ordering of events in multi-threaded programs. In this
thesis we explore these issues in detail and justify our choices.
Portable: Elephant Tracks is implemented as a JVMTI agent that runs on any com-
pliant Java virtual machine. Running with out modification to the underlying
Java Virtual Machine represents a research contribution of Elephant Tracks.
Fast: Elephant Tracks is as fast as existing tools, while providing much more in-
formation. Elephant Tracks includes performance tuning and optimizations
to reduce overhead, which is critical for larger, long-running programs. Pro-
grams instrumented with Elephant Tracks run hundreds of times slower than
uninstrumented programs, but that is comparable to existing tools, while pro-
viding much more information.
In the following sections we explore the design space of GC tracing tools, and explain the choices made for Elephant Tracks. We discuss the technical challenges of building such a tool using the JVMTI interface, which does not provide direct access to the JVM’s representation of Java Objects, or to the garbage collector. We also discuss the handling of weak references. This proved difficult because the JVM is able to side-step some of our instrumentation techniques in this case. Finally, we present some results, including overhead measurements, as well as new insights about the benchmarks gleaned from our precise traces.
3.2 Trace time
For Merlin-based tracing we need a notion of trace time, so that object death records, which are generated only at GC time, can be inserted in their proper place in the
13 trace. The choice of trace time has a profound affect on the implementation of the tool and on the resulting traces.
Real time is a poor choice, since it is dependent on many factors, including the virtual machine, the operating system, and the hardware. In addition, tracing tends to slow programs down significantly, so the real times are likely to be significantly different from uninstrumented runs. Real time is also, in some sense, too precise: we do not want the trace to reflect the time it takes to actually perform a timestamp or record a trace record.
The solution is to express time in terms of some program-level event: each time the event is encountered we tick the trace clock. In this way, time depends only on a property of the program, not on the VM or underlying machine. This model breaks time into discrete steps, each of which represents a small region of program execution.
The choice of which event(s) to use for the clock affects the granularity of time, which ultimately determines the precision of the trace, since trace records labeled with the same time cannot be ordered or localized within the region covered by that time step. The trade-off is that more fine-grained notions of time are more difficult to implement correctly, since we need to place the instrumentation more precisely to make sure that every event is labeled with the correct time. They may also incur more overhead. In the following subsections, we will contrast two approaches: allocation time, and method time.
Allocation Time
GCTrace measures time in terms of the number of bytes allocated since the program started (called allocation time). At each allocation, time advances by the number of bytes allocated. Allocation time is good for basic GC research, since the traces are precise enough to drive simulations of experimental GC algorithms. Allocation
14 time is fairly coarse, however, and a single time step can cover a large region of the code spanning multiple method calls. This notion of time was adequate for the GCTrace, since they were primarily interest in simulating garbage collection, and none of the collectors they were interested in simulating had any reliance on particular methods.
Method Time
Elephant Tracks measures time in terms of the number of method entries and exits executed (which we call method time). To get a sense of the difference in preci- sion, consider that across the DaCapo benchmarks there are, on average, 70 method entry/exit events between any two allocations (we present more measurements in
Section 3.6). Method time is almost a strict superset of allocation time, since every allocation of a scalar object also calls at least one constructor. The exception is array allocations (in Java, arrays do not have constructors), but in our experience these are not frequent enough to change the results significantly. Also, if a con- structor receives as an argument (not the receiver) a new object, there can be two allocations without an intervening constructor call. Again, this is not common. Ele- phant Tracks uses method time as it allows us to place events in a particular calling context.
3.3 Background and related work
In this section we describe the general garbage collection tracing problem and ex- isting solutions, and motivate the need for a new trace generator.
15 3.3.1 Garbage collection tracing
A GC trace is a record of the sequence of memory management events that oc- cur during a program’s execution. The events of interest may vary depending on the intended use of the trace, but typically include object allocation, object recla- mation, and mutations in the heap. Many of these events are straightforward to capture, such as object allocation, since they are explicitly invoked in the code.2
We can instrument those operations directly to emit a trace record with the relevant information.
The central challenge in GC tracing is determining object death times. An ob- vious solution is to emit an object death record when the garbage collector actually reclaims each object. This approach is easy to implement using JVMTI, but is un- appealing for at least two reasons. First, the particular timing of these events is collector-specific—we would measure a property of the GC algorithm used during trace generation, rather than a fundamental property of the program. Second, the resulting information is very imprecise. Most garbage collectors run infrequently, reclaiming large numbers of objects long after they are no longer needed by the program. As a consequence, object deaths appear far removed from the program events that actually cause them. This makes the traces poorly suited to evaluating new GC algorithms, as the Merlin work showed (Hertz et al., 2006).
Idealized death times
Our goal is to generate traces with idealized death records. That is, each object death appears in the trace at the earliest time at which the object could be reclaimed.
An idealized trace captures the behavior of a program independent of any particu- lar GC algorithm, with object death events appearing close to the program actions
2This is more subtle than you might think. In Java, the virtual machine may allocate application- visible objects as side-effects of other actions, such as class loading, and native libraries can also do so. Similar remarks apply to pointer updates.
16 that cause them. The exact nature of this problem depends on how we define “the
earliest time”. For example, we could compute death times based on object use
(see fig. 2.1): an object is dead immediately after its last use. While interesting as a
lower bound, this level of precision is potentially expensive to compute and cannot
be exploited by any practical memory manager. Therefore, we adopt the definition
used in garbage collection and in prior work on tracing: an object is dead when it
is no longer reachable from the roots (local and global variables) directly or indi- rectly through any sequence of pointers. This choice still leaves many subtle issues, however, including the granularity of trace time and the liveness of root variables, which are discussed in more detail in Section 3.4.
A naive algorithm for computing idealized death times is to run the garbage collector much more frequently. For example, we could produce a very precise trace by invoking the collector at every program point where an object could become dead. Not surprisingly, this approach is impractical.
The Merlin algorithm
The Merlin algorithm, introduced by: (Hertz et al., 2002) (reproduced here in
Figure 3.1 ), solves this problem by using timestamps to infer the idealized death
times of objects when they are reclaimed at regularly scheduled garbage collec-
tions. During normal execution the algorithm timestamps objects whenever they
lose an incoming pointer. At any point in time an object’s timestamp represents
the last time it was directly observed to be reachable. When the collector reclaims
an object, however, its timestamp is not necessarily its death time. In many cases
an object becomes unreachable indirectly, when an object that points to it becomes unreachable. In this case we need to determine which event occurred later: the di- rect loss of an incoming pointer (the timestamp), or the indirect loss of reachability
(the death times of the referring objects). So, the idealized death time of an object
17 void ComputeObjectDeathTimes() Time lastTime = sort unreachable objects from the earliest timestamp to the latest; push each unreachable object onto a stack from the earliest timestamp to the latest;
while (!stack.empty()) Object obj = stack.pop(); // pop obj with next earlier timestamp Time objTime = obj.timeStamp; if (objTime <= lastTime) // dont reprocess relabeled objects lastTime = objTime; for each (field in obj) if (isPointer(field) && obj.field != null) Object target = getMemoryWord(obj.field); Time targetTime = target.timeStamp; if (isUnreachable(target) && targetTime < lastTime) target.timeStamp = lastTime; stack.push(target);
Figure 3.1: Pseudo code for the merlin algorithm.
(Td(o)) is computed from its timestamp (Ts(o)) and the death times of any objects that point to it:
Td(o) = Max(Ts(o),{Td(p),∀p : p references o})
This insight leads to a practical approach for GC tracing that is also at central to the system we present in this chapter:
• During normal execution:
– Record ordinary events in the trace as they occur (e.g., object allocations
and pointer updates).
– Timestamp objects whenever they might become directly unreachable
(i.e., when they lose an incoming pointer).
• At GC time:
– Compute idealized death times using the formula above (implemented
18 roughly as a depth-first search on the graph of dead objects, pushing
computed death times across the pointers).
– Generate a death event record for each reclaimed object and insert it in
the proper place in the trace.
– Flush records to disk, and continue ...
An important implication of the Merlin algorithm is that it requires a notion of trace time for use in the timestamps. In fact all events in the trace must be associated with timestamps because the we cannot learn about death at the time it occurs— we discover the true death times of objects only at GC time, which is typically much later. The model of trace time (in particular, its granularity) has a profound impact on the implementation of the trace generator and the precision of the traces it generates.
3.3.2 Why a new trace generator?
The first realization of Merlin took the form of a customized garbage collector called GCTrace, implemented in JikesRVM. The main advantage of this approach is that the implementation can be integrated directly into the virtual machine code.
The compiler can be modified to add instrumentation to object allocations and pointer updates, and the garbage collector can be modified to perform the extra death time computation. GCTrace is a valuable tool. Its primary goal was to pro- vide a trace that allowed simulated garbage collections with great accuracy. We elaborate on some of its properties, and discuss where and why we want to improve on them, below.
Precise: GCTrace uses allocation time for its traces: the trace time clock “ticks”
at each object allocation. This is a good choice for its goals, as allocation is
what causes the garbage collector to eventually do work. However, it does
19 mean that object deaths and other events that occur in between allocations
cannot be ordered or precisely localized at any finer granularity.
A related problem is that allocation time does not correspond to anything
static in the program itself, so figuring out where events occur relative to the
code is very difficult (e.g., “In which method did the death of object 739229
occur?”). If a collector wants to take advantage of this information, for exam-
ple by doing something different in different phases of program execution (as
demarcated by methods), it would be difficult to prototype using GCTrace.
Completeness: GCTrace works with many Java programs, but since since it was
created Java programs have grown more sophisticated. Multithreading, use
of the Java Native Interface, and weak references are common; these were
not handled by GCTrace.
Integrated into VM GCTrace is implemented as a garbage collector within the
VM. This allows it to exploit internal structures of the VM for performance.
However, it does leave it exposed to changes in those structures. MMTk (the
memory management toolkit used in JikesRVM) has undergone a number of
radical refactorings, often leaving the GCTrace implementation out of date.
A major goal of our system therefore is to create a useful memory tracing tool
without modifying the JVM internally.
The calling context is potentially useful for a variety of optimizations. For
example Xu (2013) presents an optimization to re-use the memory of objects
if they die before the next time their allocation site is invoked. Although
Elephant Tracks is not fast enough to use on-line, it would allow researchers
to prototype such ideas by analyzing traces before investing time in complex
modifications to a JVM.
20 3.3.3 Related work
The work most closely related to ours is the original GCTrace implementation of the
Merlin algorithm (Hertz et al., 2006), which is discussed in detail throughout this chapter. Foucar reimplemented GCTrace using a shadow heap (see section 3.5.3) implemented in C++, like Elephant Tracks, but otherwise preserving the execution model and dependence on JikesRVM (Foucar, 2006).
Two prior papers explore the relationship between liveness and reachability for garbage collection. Agesen et al. (Agesen et al., 1998) examine the effects of apply- ing different levels of liveness analysis to the GC root set (variables on the stack).
They found that on average the differences were small, but on occasion static live- ness analysis would improve collection efficiency noticeably. This result suggests that our dynamic liveness model is reasonable for most purposes, but could be im- proved (see later discussion). Hirzel et al. (Hirzel et al., 2002) additionally consider the difference between reachability and last-use liveness of objects (see object life- time 2.1). They also find that schemes based on liveness of variables (i.e. employ- ing a compiler analysis to determine whether variables on the stack are no longer live, and therefore are not part of the GCs root set) have little impact on when a reachability-based garbage collector could collect objects. However, they do find that an objects last use time and the time it becomes unreachable are often signifi- cantly different. Elephant Tracks currently cannot compute equivalent information.
GC traces have been used to drive empirical studies of heap behavior, especially those examining the distribution and predictability of the lifetimes of objects (In- oue et al., 2006; Jones and Ryder, 2008). At a coarse level, measuring time in bytes allocated and measuring time with method events do not produce dramatically dif- ferent lifetime distributions. For analyses that are sensitive to program structure, however, this may be inadequate: many methods may be executed between any two allocations, and so the trace does not record accurately what event occurred in what
21 method. In addition, allocation time is not stable across runs of a program under different inputs.
Jones and Ryder (Jones and Ryder, 2008) show that the calling context of ob- ject allocation correlates well with lifetime i.e objects allocated in the same calling context tend to live the same amount of time. They could not determine, however, whether the calling context of object death correlates with lifetime, which might be a more useful fact for further improving garbage collection.
Inoue et al. (Inoue et al., 2006) look at what information is needed to precisely predict the lifetime of an object at the time of its allocation. They define a fully precise predictor as one that predicts lifetime to within a single quantum of time. By using bytes-of-allocation as their unit of time, they significantly reduce the coverage and accuracy of their predictors. The lifetime of an object in bytes-of-allocation time is much less stable than the calling context of its death, since the latter is directly related to its cause of death, while the former includes many irrelevant events (i.e., unrelated allocations).
Compile-time GC (Guyer et al., 2006) and connectivity-based garbage collec- tion (Hirzel et al., 2003) are two examples of techniques where knowing the pro- gram location at which an object dies is crucial. Such techniques are often evaluated using trace-driven simulation before embarking on a full implementation. Using
Elephant Tracks traces would yield a more accurate assessment of their potential.
Lambert et al. present a system for performing platform-independent JVM tim- ing (Lambert and Power, 2008). Although similar in spirit to our JVM-independent execution model, the focus of this work is on developing a model of code execution, rather than heap memory behavior.
Uhlig and Mudge (Uhlig and Mudge, 1997) present a survey of memory tracing techniques. While their focus is on tracing memory accesses for architecture and system research, they enumerate a set of features they consider desirable for tracing
22 systems in general: completeness (all relevant events are recorded), detail (events are associated with program-level information), low distortion (tracing does not change the program’s behavior), and portability. Elephant Tracks achieves many of these goals, although it significantly distorts actual running time, however, which is why we use a separate notion of time.
3.4 Elephant Tracks Design
Our goals in designing a new trace generator are to address the limitations of prior systems and to add new functionality to support new kinds of program analysis and memory management research. The central features of this design are (1) the kinds of program events recorded in a trace, and (2) the accuracy of this information with respect to some model of program execution. In this section we present the design of Elephant Tracks, and we discuss our choices in the context of the general GC tracing design space. In Section 3.5 we describe how this design is implemented.
3.4.1 Kinds of trace records
A minimal GC trace consists of just a sequence of object allocations and object deaths, labeled with the trace time and thread ID of each event. Without more information, though, such a trace has limited utility. In practice we add trace records for other kinds of relevant events to provide context for program analysis and to enable more kinds of trace-based simulations. For garbage collection research, for example, it is useful to add trace records for pointer updates in the heap, allowing a simulator to maintain an accurate heap model. Elephant Tracks can be configured to produce different kinds of trace records. We currently support the following kinds of records:
• Object allocations and object deaths (with idealized death times computed
23 using the Merlin algorithm).
• Pointer updates in the heap: These records include the source and target ob-
jects, as well as the object field or array index being updated. We also include
updates of static fields.
• Method entry and exit: These records allow trace times to be mapped to
specific methods, and even more precisely, to specific calling contexts.
• Exceptions: We augment method entry and exit to indicate when an exception
is thrown, the sequence of method calls (if any) that are terminated early
because of the exception, and the entry to a handler for the exception. The
main purpose of these events is to provide accurate information about method
execution context.
Separately from the trace, Elephant Tracks also emits information about each class loaded, each field declared in the class, each method declared in the class, and each allocation site in each method. This information is referred to by the trace, e.g., the trace will mention a unique allocation site number, which can be found in the side description file.
We currently do not generate trace records for object timestamps or for gen- eral memory accesses (including stack reads and writes). This information would enable an even wider range of applications, such as cache simulations. These events are extremely frequent, however, and would result in overwhelmingly large traces. In addition, instrumenting every single variable access would be technically challenging—bytecode rewriting might not be the best approach for this level of detail.
24 3.4.2 Execution model
Ideally, we would like to generate perfect traces, in which every event is recorded with a perfectly accurate and precise time. But this goal raises a critical question: accurate with respect to what? That is, what aspects of the execution model that we want the trace to represent? Elephant Tracks, like other trace generators, re- lies on a host virtual machine to execute the target program. It runs alongside the
VM, recording relevant events. The problem is that the timing of some events is highly VM dependent—directly recording these events as they occur produces a trace reflective of arbitrary VM implementation choices. Instead, we want to gener- ate traces that abstract away some details of the VM’s execution model, and record events in their own well-defined, less VM-specific, aspects. In particular, we would like a trace that places objects at the first time when any collector could collect them
(at the point where the object becomes unreachable, see figure 2.1). The possible models range widely, with some elements closer to the VM (essentially profiling the VM), and other elements more abstract, capturing an idealized execution of the program.
The main aspects of the execution model we wish to capture are (1) the defini- tion of object lifetime (in particular, when are objects considered dead), and (2) the definition of trace time (i.e., when does the trace time clock “tick” and with what frequency). The overall goal of the Elephant Tracks is to define these components in such a way that events can be localized precisely within the structure of the code.
The model is idealized for object lifetimes, but resorts to VM timing in cases where an idealized model is not possible, such as the interleaving of concurrent threads and the clearing of weak references.
One potential approach is to use non-deferred reference counting, which re- claims objects as soon as their reference counts becomes zero. Like reference count- ing collection, however, this approach cannot directly detect the death of cycles of
25 objects, and would require frequent tracing collections to achieve high precision.
Therefore, we do not use this approach.
Defining object lifetime
Object lifetimes are delineated by allocation and death events. Most object alloca-
tions are explicit in the program, so simply recording them as they occur produces
a VM-independent trace. We have found, however, that there are several other
sources of allocations, including VM internal allocations (e.g., String constants in
class files and Class objects themselves), objects allocated by the VM before it can
even turn instrumentation on, and objects allocated by JNI calls. We capture all of
these, but cannot associate them with a usual allocation site in the application, and
for those allocated very early in the run, we cannot provide relative time or context
of allocation.
For object deaths, however, an explicit goal of GC tracing is to compute ide-
alized death times. Both Elephant Tracks and GCTrace adopt the standard GC
definition: an object is dead when it is no longer reachable from the roots (local and
global variables). Even within this seemingly narrow definition, however, there are
a range of possible models. To see why, consider the program events that can cause
an object to become unreachable:
• The program overwrites a pointer in the heap (putfield, etc.)
• The program overwrites a static (global) reference (putstatic)
• A local reference variable goes out of scope
• The program changes the value of a local reference variable
• A weak reference is cleared by the garbage collector
While the first two (heap and global writes) are straightforward to instrument, local variables and weak references are more difficult to pin down. Furthermore, there
26 are roots inside the VM that we cannot observe and that the VM does not neces-
sarily inform us about when they change. Fortunately these are mostly “immortal”
references, such as to class objects, or relate to constants constructed from class
files (these may come and go).
Local variables
Tracing local variables presents many design choices and challenges. The key
question is: at what point is a local pointer variable dead, and therefore no longer
keeping the target object alive? At one end of the spectrum we could consider local
variables live throughout the method invocation with which they are associated. In
practice, however, most virtual machines apply some form of static liveness analysis
to compute more precise lifetimes. The virtual machine uses this information to
construct GC maps, which tell the garbage collector which variables to consider as
GC roots at a given point in the method.
GCTrace uses the GC maps in JikesRVM to determine which variables are live.
The advantage of this approach is that it is straightforward to implement. The down-
side is that the timing of the object death records depends on the specific liveness
analysis algorithm and choice of GC points made in JikesRVM.
Elephant Tracks currently uses a form of dynamic liveness to determine the life-
times of local variables. This choice reflects implementation decisions (described
in more detail below). A variable is considered dead after its last dynamic use. We
define a use as one of the following: (1) a direct dereference (access to an object or
array), (2) a type test, such as instanceof, (3) obtaining an array’s length, (4) use as a receiver of a dynamic method dispatch, or (5) a reference test, such as ifnull.
Dynamic liveness, however, is more precise than static liveness analysis, pri-
marily because it is not conservative about liveness on different execution paths.
The resulting traces show some object death times earlier than any real garbage
collector could achieve. For example, a reference that is held while a series of
27 methods are invoked, but never used or passed to any method, is considered dead in all the methods. We consider a variable live if it is passed to a method call as a parameter, or returned, even if it is never used with in those methods, or the return value is not used.
Weak references
Weak References are a special class in Java used to manage the collection of objects. An object reachable only via weak references can be collected at any time.
Listing 3.1: ”Weak reference usage”
WeakReference
For example, the code in 3.1 would create a new weak reference to the object referred to by variable foo, called . Internal to the WeakReference class is a field called referent. may also have normal (strong) references to it, and while it remains strongly reachable it will not be collected. However, if the program eliminates these strong references, and becomes reachable only via weak references, it may be collected. In this case, the referent field will be nulled out. Java also supports
Soft and Phantom references, which operate in a similar way, but have different semantics for when the referent field may be null.
Weak references and their cousins present an interesting challenge. In principle, the garbage collector can choose to clear weak references at any time (or not at all) once an object is no longer strongly reachable. In practice, they will only be cleared when the collector is actually run. Further, soft references are cleared “at the discretion” of the collector, in response to memory pressure. Phantom references are similarly affected by the timing of collector runs by the host VM. For a trace, though, this leaves no obvious idealized model of when to clear a weak reference.
Both Elephant Tracks and GCTrace opt to record these events when the VM chooses to perform them. Given that programs can perceive and respond to the collector’s
28 decision, there is no good alternative to this approach.
Concurrency
Most modern software uses concurrency in some form, which raises the question of
how to order trace events that occur in different threads. We adopt a straightforward
approach in which time is global, but trace records include both the time3 and the
ID of the thread in which the event occurred. In the current implementation, how-
ever, timestamps on objects do not include the thread ID, so object deaths cannot
necessarily be assigned to particular threads.
One problem with this approach is that the resulting traces encode the schedul-
ing decisions of the VM and operating system. Furthermore, trace instrumentation
perturbs program execution significantly, resulting in schedules that could be quite
different from the uninstrumented programs. While interesting, this problem is
difficult to address without controlling the scheduler directly—for example, by re-
playing a schedule from a real run. One potential solution is to represent time using
vector clocks, which would encode only the necessary timing dependences between
threads. However, this would still suffer from particularity of orders of interactions.
We hope to investigate alternative designs in the future.
3.5 Implementation
Elephant Tracks is implemented as a Java agent that uses the Java Virtual Ma- chine Tool Interface (JVMTI). The primary components of a system using ET are: the JVM itself, including its JVMTI and JNI support; the application; the Elephant
Tracks agent; the ElephantTracks Java class file, which connects bytecode instru- mentation to the agent via Java Native Interface (JNI) calls; and the instrumenter,
3We do not actually output the time value, but it can be derived by knowing which events “tick” the clock.
29 which rewrites the bytecode of classes as they are loaded.
3.5.1 Timestamping strategy
For Merlin to produce precise death times, the timestamp on an object must always be the time at which the object last lost an incoming reference. This invariant is easy to maintain for heap and static references, since we can directly instrument these operations, timestamping the old target before allowing the store to proceed.
For stack references, however, there is no explicit operation denoting the end of a variable’s scope. There are essentially two strategies for solving this problem:
(1) timestamp all live variables at every time step, or (2) timestamp each variable exactly when its lifetime ends. (Recall that we define a variable as being live only up to its last actual use.)
GCTrace uses strategy (1), which has the advantage that it is straightforward to implement: at each tick of the clock, walk the stack and timestamp each object re- ferred to by a live variable. This strategy, however, creates a trade-off between per- formance and precision. Walking the stack is a extreamly expensive operation, so it cannot be performed frequently, limiting the granularity of the clock. The problem is particularly acute when using allocation time, since a single time step can span multiple methods, requiring a full walk of the call stack at every tick. We believe that stack walking also inhibits code optimizations (or forces de-optimization), fur- ther slowing execution. Furthermore, as mentioned in Section 3.4 it relies on the
VM’s GC Maps to define variable liveness.
Elephant Tracks uses strategy (2). This approach requires more instrumentation to timestamp a variable’s referent whenever the variable is used. It has several ad- vantages, though. The most important is that it works correctly for any granularity of time. In addition, it gives the trace generator explicit control over the model of variable liveness. Finally, it is amenable to an instrumentation-time optimization
30 (described below) that eliminates redundant timestamping operations.
3.5.2 The instrumenter
The instrumenter is ordinary Java code and is written using the ASM bytecode rewriting tool (Bruneton et al., 2002). The current version of ET is written to use
ASM 3.3.1. ASM should work on any Java 1.6 class file, and will produce Java 1.6 class files who’s meaning is well defined on any standards compliant JVM. There- fore, we are not as worried about introducing reliance on ASM as we are about relying on JVM internals, which could be changed at any time (as long as they still implement the standard). In order to avoid recursive self-instrumentation between instrumenter code and the application, we run the instrumenter in a separate oper- ating system process, connected with the agent via pipes in both directions. The agent uses the JVMTI ClassFileLoadHook callback, which causes the JVM to present to the agent each class that the JVM wants to load, and to give the agent the opportunity to substitute other bytecode for what the JVM presents. The ET agent sends the bytecode to the instrumenter, which sends back an instrumented class file.
The instrumenter assigns a unique number to each class, each field, each method, and each allocation site (for both scalars and arrays) in each method, writing them to what we call the names file. The instrumenter also sends the class and field in- formation to the agent. (At present the agent has no need to maintain tables for the other information, so it is not sent.)
Ordinary instrumentation
We defer to Section 3.5.2 some special cases, and describe now the usual instru- mentation added by the ET instrumenter. We organize the description by feature.
Method entry and exit: On entry, and just before a return, we insert a call noting
31 the id of the method and the receiver (for instance methods). In a constructor
we cannot actually pass the receiver (it is not yet initialized), so we pass null
and the agent uses a JNI GetLocalObject call to retrieve the receiver from
the stack frame.
Exception throw: At an athrow bytecode we insert a call that passes the exception
object, the method id, and the receiver (for instance methods). The same
special handling of the receiver in constructors happens here, too.
Exception exit: To detect exceptional exit of a method, we wrap each method’s
original bytecode with a catch-anything exception handler, which makes a
call indicating the same information as for a throw, and then re-throws the
exception.
Exception handle: At the start of each exception handler we insert a call that notes
the same information as for a throw.
Scalar object allocations: The basic idea is to insert, after the new bytecode, a call
that indicates the new object, its class, and the allocation site number. How-
ever, we cannot pass the new object directly since it is uninitialized. Further,
it is on the JVM stack, not in a local variable, so the JNI GetLocalObject
function will not work. Our solution is to add one extra local variable to any
method that allocates a scalar. The instrumented code dups the new object
reference and astores it to the extra local. In the call to the agent, the in-
strumented code indicates which local variable the agent should examine to
obtain the object reference. Strictly speaking, we do not need to pass the
class, since the agent can figure it out; we may remove that in the future.
Array allocations: New arrays start life fully initialized, so we simply pass them in
a call to the agent, along with the allocation site number. For multianewarray
we call an out-of-line instrumentation routine that informs the agent of each
32 of the whole collection of new arrays that are created. This could also be
done in the agent, if desired.
Pointer updates: For putfield of a reference and for aastore we insert, before
the bytecode, a call that notes the object being changed, the object reference
being stored, and the field (or index, for an array) being updated. Java allows
putfield on uninitialized objects (mostly so that an instance of an inner class
can have its pointer to its “containing” outer class instance installed, before
invoking the inner class constructor). In that case we use the same technique
as for scalar allocations to indicate to the agent the object being updated.
Uses of objects: As mentioned in Section 3.4, ET timestamps objects when they
are used. We mentioned there the cases in which that happens. We simply
insert a call, passing the object to be timestamped. On method entry we times-
tamp all pointer arguments, including the receiver. (In constructors we make
a slightly different call since we cannot pass the receiver; the agent fetches it
out of the frame.) For efficiency on method entry, we have timestamp calls
that take 2, 3, 4, or 5 objects to stamp.
Counts: As an extension controlled by a command line flag, the instrumenter will
also track the number of heap read and writes, the number of heap reads
and writes of reference values, and the number of bytecodes executed, and
insert calls reporting these just before each action that advances the timestamp
clock, and just before control flow branch and merge points.
The instrumenter also includes a simple kind of optimization to reduce the number of timestamp calls. It tracks which variables (locals and stack) have been times- tamped since the last tick or the last bytecode frame object. (Frames occur at control
flow merge points, and detail the types of the local and stack variables at that point.)
It avoid timestamping an object twice in the same tick. This optimization requires
33 tracking object references as bytecodes move them around, but is straightforward.
The optimization is effective, and necessary in order to avoid having some methods
increase in size so much, because of added instrumentation, that they overflow the
maximum allowed method size. We present the results in table 3.1 on page 34.
The average enlargement factor (as measured by increase in byte size) of a class
due to injected instrumentation was 1.47 with out optimization, and 1.38 with op-
timization. batik was unable to run to completion with out this optimization, as the number of byte codes exceeded the number allowed in a Java method. The optimization is particularly effective on methods that initialize large arrays.
Benchmark no opt opt avrora 1.46 1.34 batik - 1.31 eclipse 1.38 1.27 fop 1.47 1.37 h2 1.49 1.48 jython 1.58 1.52 luindex 1.48 1.38 lusearch 1.48 1.48 pmd 1.45 1.53 sunflow 1.48 1.38 tomcat 1.43 1.40 xalan 1.46 1.37 geomean 1.47 1.38
Table 3.1: The average class enlargement for each of the DaCapo benchmarks (as measured by increase in size by bytes), with and with out the optimizations de- scribed in section 3.5.2 on page 31. batik does not run to completion in the no opt configuration, since it violates the maximum Java method size.
Instrumentation special cases
We now detail various special cases (beyond access to uninitialized new objects,
mentioned in the previous section). These are cases where VMs do something that
by passes the bytecode in some way that is difficult to observe. Elephant Tracks
is as VM independent as possible, but these parts might have to be reimplemented
34 on different Java VMs (although those we examined, Oracles HotSpot and IBM’s
J9 Java 1.6 JVMs, both behave in a similar way). If we did not handle these cases,
and a program happens to use these features, the trace would be inaccurate in some
detail; either missing a pointer update, allocation, or having incorrect death times
for some objects.
Native methods: In order to indicated when a native method is called and re-
turns, we change its name, prepending $$ET$$, and insert a non-native method that
calls the native method. We instrument the non-native method essentially as usual.
A number of native methods require special treatment, however:
getStackClass: This method of java.lang.Class, and several similar meth-
ods, include an argument specifying the number of stack frames to go up to
look for some information. To wrap these native methods, the ET non-native
wrapper adds one to the number of frames before calling the native. This
properly adjusts for the extra level of call that the wrapper adds. getClassContext: This method of the IBM J9 ClassLoader probes a specific
number of frames up the stack, so wrapping it disturbs the result. With re-
gret, we do not wrap it. (We contend that native methods subject to this
problem should be redesigned, like getStackClass described above, so that
they can be wrapped.) A number of other methods exhibit essentially the
same problem.
Several native methods of class Object: Specifically, getClass, notify, notifyAll,
and wait do not operate correctly if wrapped, so we omit them.
initReferenceImpl: This method of class Reference initializes the referent
field of a weak reference object. We instrument it specially so that the agent
can observe the update to the field, which otherwise would be hidden to ET.
Several methods of sun.misc.Unsafe: for allocateInstance we note the al-
35 location; for a successful compareAndSwapObject we note the pointer up-
date, as we do for putObject, putObjectVolatile, and putObjectOrdered.
All of these updating operations work in terms of the offset of a field or array
element into the object, a fact not readily available to the agent. Therefore we
instrument objectFieldOffset, staticFieldBase, staticFieldOffset,
arrayBaseOffset, and arrayBaseScale to inform the agent of the base or
offset information they return, so that the agent can map the offsets and bases
back to fields and array elements.
System.arraycopy: We instrument this specially so that the agent can note all
the resulting updates to arrays of objects. The agent does the actual work and
notes the effects, taking care to deal correctly with situations that will throw
an exception, etc.
Class Object: We instrument Object.
finalization (which breaks JVMs). Any finalize method in another class is in- strumented, so finalizations are visible in the trace.
Timestamping new objects: Trying to obtain a reference to and timestamp a new object in Object.
3.5.3 The agent
The agent performs these functions to support ET’s goals:
• Sends classes to the instrumenter and returns instrumented classes to the
JVM.
36 • Notes several actions of the JVM and responds appropriately. These include:
changes in the JVMTI phase of execution (VMStart, VMInit, and VMDeath);
GarbageCollectionFinish, which triggers a scan (described further be-
low) to see if any weak references have been cleared; and VMObjectAlloc,
to detect objects allocated directly by the VM.
• Intercepts various JNI calls so that it can emit suitable trace records, specif-
ically, AllocObject, ThrowNew, and the various NewObject and NewString
calls, to note the new object; and SetObjectField, SetStaticObjectField,
and SetObjectArrayElement to note reference updates.
• Handles the various instrumentation calls from the ElephantTracks class
and (generally) creates a trace record, inserting it into a buffer. The size of
this buffer is a configurable runtime parameter, specified by the number of
records it will hold. We find sizes in the hundreads of thousands work well in
practice.
• Maintains a model of the heap graph called the Shadow Heap. Each node rep-
resents an object and each directed edge a pointer. The model also includes
static variables, but does not (cannot) include various VM internal roots, and
as previously described, we do not model stack roots directly, but employ
timestamping to determine liveness from thread stacks. The Shadow Heap
is represented in the agents memory, not the Java heap, so it cannot cause
memory pressure that would trigger additional garbage collections.
• To help maintain the heap graph model and to identify objects in trace records,
the agent uses the JVMTI object tagging facility to associate a unique serial
number with each object, as early as possible after the object is created.
• Maintains a table of object liveness timestamps, and the timestamp “tick”
clock.
37 • Maintains a data structure describing weak objects and their referents. When-
ever the JVM runs its garbage collector, after collection completes the agent
notifies a separate agent thread to check each weak object to see if its referent
has been cleared. This thread will timestamp the now-unreachable referent
with the current time, giving a good-faith estimate as to when it died.
Trace outputting proceeds in cycles. This is because determining which objects have died, propagating timestamps, and inserting death records at the right place in the trace, is a periodic activity, done in batches. When the agent is notified that the
JVM is entering the JVMTI Live phase, the agent iterates over the initial heap and creates an object allocation record for each object and a pointer update record for each non-null instance and static field. When the agent is notified that the JVM is entering the JVMTI Dead phase (JVM shutdown), it closes out the current buffer of trace records.
In between, during the Live phase, whenever the trace buffer fills with records, the agent:
1. Forces a garbage collection and then iterates over the remaining heap. This
allows the agent to detect which objects have been reclaimed since the trace
buffer was last emptied.
2. Applies the Merlin algorithm to compute object death times (really “last time
alive” times).
3. Checks weak objects to see if their referent as been cleared. The VM does
not inform the agent directly about this, but since we note referent field ini-
tialization, we know about weak objects and their referent targets. The tables
the agent maintains for these are carefully designed not to keep the objects
live (we use JNI weak references).
4. Adds death records to the trace buffer, properly timestamped.
38 5. Sorts the records using a stable sort, and outputs them.
The last step, sorting and outputting, we observed to consume about half the time
of creating a trace and so we developed a parallel version. We report performance
results in Section 3.6.
3.5.4 Properties of our implementation approach
Our implementation strategy has many advantages and few drawbacks. Its advan-
tages include:
• It works with commercial JVMs (in principle with any JVM supporting JVMTI;
we have tested it with the Oracle Hotspot and IBMJ9 1.6 JVMs) and can run
any application. Of course timing-dependent applications may misbehave as
with any tool that slows execution, etc. This prevents several DaCapo bench-
marks from completing successfully.
• The run-time is implemented in C++, with all of its data structures outside of
the JVM. This makes it easier to insure that ET data structures and actions
are not inappropriately entwined with the application and JVM.
• The instrumenter is in a separate process, insuring it does not become recur-
sively instrument itself. As discussed previously, our reliance on ASM is not
problematic because ASM is widely used, actively maintained, and part of
the infrastructure of at least one major commercial JVM (Oracle’s HotSpot).
• We capture even some very tricky cases, including weak references, field up-
dates via sun.misc.Unsafe, reflective object creation, updates, and method
calls, VM internal allocations, relevant JNI calls made by the VM or other
native libraries, and System.arraycopy.
Drawbacks of ET as it stands are mostly ones that similar tools are likely to share:
39 • A few methods cannot be instrumented, since doing so breaks the JVM.
• Relative timing and thread interactions are affected, which may change ap-
plication behavior.
• Weak reference clearing is dependent on the vagaries of the JVM.
• Precision in determining object deaths, and the general wealth of information
in the traces, come at a cost: the execution dilation factor is on the order of
hundreds (see Section 3.6 for performance results).
• The resulting system is not as simple as we would like. There are places with
somewhat tricky synchronization and more data structures and mappings than
we would like, but it is not easy to deal with features such as weak references
and sun.misc.Unsafe.
• We rely heavily on correctness and completeness of JVMTI and JNI support.
One implication is that, at present, JikesRVM cannot support ET. Also, we
have discovered previously unreported JVM bugs, such as failure of one JVM
to present for rewriting every class it loads, which implies that a handful of
classes go uninstrumented. (That bug is being fixed, but it appears we later
found a similar case whose fix will take longer.)
3.6 Results
3.6.1 Performance
In this section we present results from running Elephant Tracks on the DaCapo
Benchmarks in order to give a sense of its performance and the properties of the re- sulting traces. (Unfortunately, it fails to run tradebeans and tradesoap, perhaps because of internal timeouts.) In table 3.2 on page 45 on page 45 we present the run-time overhead of our tool under several configurations:
40 In the No Callback configuration, all of our bytecode instrumentation was in- jected, but callbacks into the JVMTI agent were disabled (resulting in an empty trace). Additionally, the No Callback configuration enables only the absolute min- imum number of JVMTI features necessary to instrument classes. This represents a practical lower bound on the overhead of instrumenting class files and executing the instrumented bytecode, without the overhead of calling into the JVMTI agent, processing the events, or producing a trace record.
The Serial ET configuration periodically pauses to generate death records, put them in order, and output them to the trace file. In contrast, the Parallel ET con-
figuration spawns a separate thread to do this work. This generally results in better performance and fewer pauses in the traced application, but may be of no benefit if the machine lacks sufficient resources to execute this thread in parallel with the application.
With a geometric mean of about 250, the overall dilation factor of Elephant
Tracks is within a factor of two of the published dilation factors of GCTrace (Hertz et al., 2002, 2006), while providing much more information.
The dilation factors of the different benchmarks are not uniform. This diversity cannot be explained only by the differences in amount of instrumentation, since there is no simple linear relationship between the No Callback configuration and the other configurations. Similarly, it could not be explained by a simple linear model relating number of calls into the JVMTI agent and/or average heap size of the benchmark being traced (at least, no model we were able to discover). Therefore, we theorize that it relates to complex interactions between our instrumentation, Java optimizations, and/or the implementation details of the JVMTI interface.
41 3.6.2 Trace analysis
Table 3.3 on page 46 shows the composition of the traces by event type (percentage of the trace accounted for by each type). Method entry and exit events outnumber the others significantly, which is why method time is so much more precise than allocation time. In fact, on average there are 70 method entry/exit events between any two allocations. In other words, a single tick of the allocation clock can span dozens of methods, making it difficult to localize object death events within the code. A single tick of the method time clock occasionally contains an allocation, and depending on where the starting and ending method events are found, we might not be able to tell if a death event occurred before or after the allocation. In a few very rare cases, a single unit of method time might contain multiple allocations, but they would have to be arrays, since regular objects always have a constructor call.
In order to demonstrate the value of these more precise traces, we present a few simple trace analysis examples. First, a simple escape analysis is easy to per- form with ET traces. We process a trace, and upon encountering a record of object allocation, note the context in which it occurred. Then, if the death event for that same object is encountered before the associated method return, we know the object has not escaped. Conversely, if we do not find the death record before this point, the object has escaped. Note that this does not necessarily mean that there is any static analysis that could have determined in advance that the object would or would not have escaped. The results of this escape analysis are reported in table 3.5 on page 47, where we see that in most benchmarks a majority of objects escape their allocating context.
Second, previous work has shown that the allocation site plus some calling con- text is a good basis for predicting object lifetime (measured in bytes of alloca- tion) Jones and Ryder (2008). Since Elephant Tracks’ traces can also provide the
42 calling context of an object’s death, it is possible to consider whether the allocation context is also a predictor of death context.
As a preliminary investigation, we performed the following analysis. Each ob- ject’s allocation is recorded with a triple, consisting of the allocation site, the allo- cating method, and the caller of the allocating method (this gives us a partial calling context). Next, the analysis finds the top ten allocation contexts (based on number of objects allocated). For each object allocated in these contexts, it determines the object’s partial death context (there is no site for a death event, so we consider only the method in which it died, and the calling method). Finally, the analysis finds, for each allocation context, the most common death context for objects with that allo- cation context. In table 3.6 on page 48 we report the average percentage of objects allocated in the top ten contexts that die in the plurality context.
This initial work suggests that the death context of an object may be a stable and predictable feature. However, additional refinement will be required to further illu- minate the relationship between an object’s allocation context and its death context, as well as to determine if this relationship can be exploited for any optimization.
3.7 Conclusions
In this chapter, We have presented Elephant Tracks, a tool for efficiently generating program traces including accurate object death records. Unlike previous tools, Ele- phant Tracks traces allow recorded events to be placed in the context of the methods of the program being traced. It also works independently of any particular choice of the JVM to which it attaches. These two features will allow the prototyping of new
GC algorithms and new kinds of program analysis, and let the tool keep up with changes in the JVM and class libraries. Elephant Tracks offers performance com- parable to similar previous tools but a wealth more information and covers many
43 more of the tricky and corner cases of Java and Java virtual machines.
In the following chapter, we will see how the information produced by Elephant
Tracks can be used to inform garbage collection strategies.
44 Benchmark No Callback Serial ET Parallel ET Dilation Dilation Dilation avrora-default 1.6 436.5 291.1 avrora-large 0.9 553.9 436.1 avrora-small 1.7 354.0 227.3 batik-default 3.7 152.5 102.6 batik-large 3.0 124.6 84.1 batik-small 2.9 58.3 41.9 eclipse-default 18.0 310.5 2110.5 eclipse-large 19.3 498.6 1603.6 eclipse-small 50.5 47.6 4297.5 fop-default 2.6 181.2 130.2 fop-small 2.8 42.0 30.7 h2-default 4.4 3137.3 3245.8 h2-large 4.3 2652.7 3583.1 h2-small 3.1 1272.9 1038.7 h2-tiny 3.9 947.5 754.7 jython-default 2.0 342.2 235.0 jython-large 2.5 949.4 774.9 jython-small 1.6 93.1 71.5 luindex-default 1.7 88.4 71.6 luindex-small 1.6 5.8 4.4 lusearch-default 2.7 385.7 304.3 lusearch-large 2.8 451.8 327.9 lusearch-small 2.9 112.5 85.7 pmd-default 2.0 276.1 134.6 pmd-large 2.4 549.1 230.2 pmd-small 1.8 7.4 5.6 sunflow-default 5.9 1830.1 1457.8 sunflow-large 6.4 2073.2 1583.3 sunflow-small 6.9 598.1 481.2 tomcat-default 1.8 100.3 72.3 tomcat-large 1.7 240.0 175.4 tomcat-small 1.8 48.4 38.4 xalan-default 3.0 482.0 372.5 xalan-large 3.7 922.6 715.4 xalan-small 2.8 114.8 95.5 geomean 3.2 257.7 245.8
Table 3.2: Run-time overhead for Elephant Tracks on the DaCapo benchmark suite
45 Benchmark Method Alloc Catch Pointer + Death + Throw Update avrora-default 97.97 0.30 0.00 1.74 avrora-large 95.17 0.24 0.00 4.59 avrora-small 97.85 0.30 0.00 1.85 batik-default 92.01 1.70 0.00 6.29 batik-large 92.34 1.77 0.00 5.89 batik-small 92.55 2.55 0.00 4.89 eclipse-default 88.65 4.29 0.00 7.06 eclipse-large 90.24 3.23 0.00 6.52 eclipse-small 93.67 3.27 0.01 3.05 fop-default 90.30 4.63 0.00 5.07 fop-small 89.83 4.21 0.00 5.95 h2-default 94.23 2.86 0.00 2.90 h2-large 94.92 2.02 0.00 3.06 h2-small 94.05 3.04 0.00 2.91 h2-tiny 93.98 3.11 0.00 2.91 jython-default 91.64 3.08 0.02 5.26 jython-large 91.74 2.79 0.01 5.46 jython-small 97.36 1.11 0.00 1.53 luindex-default 96.37 0.28 0.00 3.36 luindex-large 90.49 5.61 0.06 3.84 luindex-small 94.91 1.36 0.00 3.72 lusearch-default 91.70 2.73 0.05 5.52 lusearch-large 91.70 2.72 0.05 5.52 lusearch-small 91.77 2.74 0.05 5.43 pmd-default 87.31 3.86 0.10 8.73 pmd-large 87.04 4.27 0.09 8.60 pmd-small 92.54 3.48 0.01 3.97 sunflow-default 94.84 3.00 0.00 2.16 sunflow-large 94.83 3.00 0.00 2.17 sunflow-small 94.90 2.96 0.00 2.14 tomcat-default 89.92 5.92 0.02 4.14 tomcat-large 90.47 6.33 0.01 3.19 tomcat-small 89.00 5.28 0.03 5.70 xalan-default 94.10 1.54 0.00 4.37 xalan-large 94.11 1.52 0.00 4.37 xalan-small 93.99 1.65 0.00 4.36 mean 92.74 2.85 0.01 4.40
Table 3.3: Percentage of each record type in traces of the DaCapo benchmark suite
46 Benchmark Shadow Heap (MB) Program Heap (MB) Shadow Heap/Program Heap avrora 1.78 2.94 0.60 batik 9.03 13.06 0.69 eclipse 22.68 28.49 0.80 fop 5.90 16.56 0.35604 jython 14.27 29.43 0.48 luindex 7.20 5.10 1.41 jython 14.27 29.43 0.48 sunflow 3.83 9.94 0.37 tomcat 16.11 12.50 1.28 xalan 5.38 6.92 0.78 mean 0.73
Table 3.4: Shadow heap memory usage compared with the size of the heap.
Benchmark % Escaping Benchmark % Escaping avrora-default 83.53 luindex-default 54.14 avrora-large 79.39 luindex-small 46.25 avrora-small 87.41 lusearch-default 39.98 batik-default 63.79 lusearch-large 40.00 batik-large 62.97 lusearch-small 40.02 batik-small 62.22 pmd-default 53.68 eclipse-default 32.32 pmd-large 52.66 eclipse-large 41.97 pmd-small 51.78 eclipse-small 26.85 sunflow-default 68.63 fop-default 55.25 sunflow-large 68.49 fop-small 65.06 sunflow-small 68.38 h2-default 58.03 tomcat-default 25.44 h2-large 58.24 tomcat-large 21.87 h2-small 58.00 tomcat-small 32.44 h2-tiny 57.82 xalan-default 54.99 jython-default 42.13 xalan-large 55.16 jython-large 42.95 xalan-small 53.59 jython-small 68.02
Table 3.5: Percentage of objects escaping their allocating context in the DaCapo benchmark suite
47 Benchmark Mean % Benchmark Mean % avrora-default 41.22 luindex-small 76.57 avrora-large 44.38 lusearch-default 83.39 avrora-small 30.06 lusearch-large 79.63 batik-default 63.05 lusearch-small 83.37 batik-large 59.64 pmd-default 47.82 batik-small 75.08 pmd-large 34.98 fop-default 81.24 pmd-small 57.70 fop-small 64.54 sunflow-default 86.12 h2-default 86.37 sunflow-large 80.12 h2-large 88.09 sunflow-small 89.54 h2-small 86.78 tomcat-default 75.80 jython-default 68.79 tomcat-large 72.15 jython-large 74.03 tomcat-small 79.32 jython-small 74.15 xalan-default 71.48 luindex-default 78.45 xalan-large 71.55
Table 3.6: Mean percentage of objects that are born in the same context and die in the same context (over top 10 allocation contexts).
48 Chapter 4
Deferred Collector
The goal of the deferred collector is to avoid repeated work in the marking phase of two successive garbage collections. Recall from chapter 2 that each time a full heap garbage collection is performed (exuding nursery collections from consideration), the collector visits all live objects. However, if a section of the heap is unchanged, this represents repeated work. If, during its traversal, the collector could know it was entering an subgraph that had not changed since its last traversal, it could avoid this repeated work.
The deferred collector takes advantage of this by taking a hint from a programer about a key object. The key object governs the lifetime of other objects, and the fact that it is alive indicates that other objects reachable from it are alive. In this chapter, we will describe how we can use the traces generated by Elephant Tracks to find key objects, the implementation of a collector that exploits this knowledge, and the result of using this collector on a program.
49 4.1 Finding Key Objects
We would like to reduce the work load of the garbage collector, and we now have a
lot of data in hand with which to attempt it. But the question is what do we look for
in the data? One idea proposed by Hayes (1991) is to find key objects. Key objects are those whose liveness governs the liveness of other objects in the heap; if the key object is alive, we can assume the governed objects are alive.
How can we discover key objects with the data available in an Elephant Tracks trace? In principle, programs can tie the lifetimes of objects together in arbitrary ways, and analyzing the program source to determine this would be difficult or impossible. However, since Elephant Tracks is a dynamic analysis tool, there is a runtime signal we can use: If an object A is reachable, then any object it refers to must also be reachable. Imagine an object A such that a large number of objects are reachable only via A (that is, A dominates them). As long as A is reachable, and the
relationship holds, we know that all the objects dominated by A must be alive.
Therefore, what we want to find are large groups of objects dominated by a
relatively small set of objects. The dominating objects can potentially serve as our
key objects, and we can avoid tracing the dominated objects. Since a generational
collector already avoids tracing short lived objects, we only want to consider objects
that live a long time. We therefore only consider objects that live for longer than
fifty percent of the programs lifetime; such objects will be visited by many garbage
collections. We also are not interested in objects that dominate only a few other
objects, so we will only consider groups of at least 10,000 objects (a threshold
choosen because it is at least a few percent of the live size of all programs studied)
dominated by some smaller set of objects.
We identified these objects by processing the traces from Elephant Tracks. The
resulting data for the DaCapo Blackburn et al. (2006) benchmarks are shown in
50 (a) avrora (b) batik
(c) eclipse (d) fop
(e) h2 (f) jython
Figure 4.1: The live size of each of the dacapo bench benchmaks with respect to time. The blue line shows the number of objects live over time. The lower green dashed line shows the number of live objects that fall into clusters as described in 4.1
51 (g) luindex (h) lusearch
(i) pmd (j) sunflow
Figure 4.1: The live size of each of the dacapo bench benchmaks with respect to time. The blue line shows the number of objects live over time. The lower green dashed line shows the number of live objects that fall into clusters and described in 4.1
figures 4.1a through 4.1m. In these figures, the blue line on top shows the number of objects live at a given time, and the green dashed line shows the number that fall into the previous described clusters. Although we have previously discussed the utility of measuring time in terms of methods instead of bytes allocated, for coarse grained measurements such as this, there is little difference. Measuring with method time would simply add more ticks to the X axis; therefore we use bytes al- located to be more consistent with existing work measuring the live sizes of dacapo benchmarks. Also note that because some benchmarks have live sizes that are or- ders of magnitude higher than others, and some allocate orders of magnitutde more than others, the axes of these figures are not all on the same scale. We see that in
52 (k) tomcat (l) tradesoap
(m) xalan
Figure 4.1: The live size of each of the dacapo bench benchmaks with respect to time. The blue line shows the number of objects live over time. The lower green dashed line shows the number of live objects that fall into clusters and described in 4.1
some benchmarks, such as pmd few such objects exist, but in others many objects
(such as h2) meet our criteria. However, even in pmd such objects represent roughly
100,000 with a heap size of 700,000 objects, representing a significant potential for savings.The difficulty will be in producing a garbage collector that avoids tracing these objects, while still collecting all (or at least most) of the dead objects. If we fail to collect too many objects, the result will be additional garbage collections, eliminating any potential savings in time.
53 4.2 Deferred Collector Design
Many objects in the heap of a program do not change at all between two succes-
sive garbage collections. However, in most garbage collectors these objects will be
repeatedly traced, despite the fact that the result of this tracing has already been
computed and yielded no garbage. To avoid this repeated work, have created the
Deferred Collector, which relies on hints from a programmer to specify key objects.
Intuitively, the key object is the root of some data structure that the programmer ex-
pects to change only infrequently. Objects within that data structure should not be
traced on each garbage collection, but rather with some lower frequency.
In order to make this happen, we add two extra bits to the header of each object.
The first bit, the key bit, is set on any of the key objects provided by the user. When
an object has a its marked bit set, we say it is marked. Analogously, when an object
has the key bit set, we will say it is a keyed. The second bit, called the deferred bit, is set on those objects we which to skip when tracing the heap. We will refer to objects with the deferred bit set as deferred. The deferred bit acts something like a durable mark-bit: when tracing the heap during the mark phase, the collector will not follow edges pointing to deferred objects.
The key objects are specified by the programmer, using an API consisting of a single method traceInfrequently. The mutator program gives a hint for a key object simply by passing the object as parameter to the traceInfrequently method. We discuss some alternative interfaces in section 4.3.1.
4.2.1 Defer all Object Reachable from the Key
We modified the basic garbage collection phases as follows.
1. A memory threshold is reached, the mutator is paused
2. The collector sets the key bit on all key objects (given by programmer hint).
54 3. The Deferral phase begins. All objects reachable from teh key obecjt are
deferred.
4. The collector enumerates all roots (references on the stack, or references to
global variables, that refer to objects in the heap)
5. The Mark Phase Begins. All objects reachable via the roots on a path that
does not go through a keyed object or deferred object are traversed and marked.
6. The Sweep Phase begins. The collector linearly scans through memory,
checking to see if objects are marked. All objects not marked, deferred, or
keyed are reclaimed (”swept”).
7. The mutator resumes.
This algorithm maintains the invariant that after the collection is complete, all reachable objects are either marked or deferred. Since the sweeper will never re- claim a marked for deferred object, this guarantees correctness. It does not how- ever, guarantee that all deferred objects are reachable, so it is possible that some unreachable objects are not collected.
Although we will need to visit all objects reachable from a key object on the first collection dealing with that key, since the deferred bit persists between collections, we will not need to visit them on subsequent collections.
Maintaining these invariants requires we changes to the write barrier in the mu- tator. Suppose that the mutator installs a pointer from a deferred object A to a
non-deferred object B. At the next garbage collection, we cannot guarantee that A
is reachable, so we have to assume that it is. Therefore, we will have to treat any
such object B as a key object in the next collection. This is crucial to correctness,
since otherwise B may never be traced, and collected while still alive.
55 4.2.2 Large Object Space
Large Objects in JikesRVM are handled with the Baker Treadmill Baker (1992)
algorithm. As this requires special consideration in our deferred collector scheme,
we will briefly describe it here, and then explain the modifications employed in the
deferred collector.
Within the large object space, there are two data structures used for the Baker
Treadmill algorithm: They are two doubly linked lists of object addresses, referred
to as the to-space and the from-space treadmill. When a garbage collection occurs, tracing from roots begins and normal. When an object in the large object space is encountered, its linked list node in the from-space treadmill is moved to the to-space treadmill (note that the objects themselves are not moved in memory; the linked listen node referring to them is instead moved from one treadmill to another).
Then, after the marking phase, all reachable objects in the large object space are on the to-space treadmill, and any objects remaining on the from space tread- mill must be unreachable. The collector then iterates over the from-space tread- mill, freeing those objects who’s nodes have not been moved. Then, the collector switches the roll of the from-space and to-space for the next collection.
Thus far we have described the existing large object space implementation.
However, a problem with this scheme in the context of the deferred collector is that the deferred collector is not guaranteed to visit each deferred object on every collection. If one of the deferred objects is in the large object space, it will therefore not be moved from the from-space treadmill to the to space. In order to fix this, we slightly modify the sweeping phase. In our modified scheme, when the collector iterates through the from space, it first checks if the object is deferred, and if so puts it on the to-space treadmill.
56 4.3 Mitigating Bad Hints
If the programmer gives us a bad hint, while it is impossible to violate memory
safety, it is possible to degrade performance. To see why, suppose the programmer
selects a key object A that is short lived, and that this object points to many other
short lived objects. In this case, if we naively trace all the objects reachable from
A and defer them, A and the objects reachable from A will be kept resident in
memory long past the end of their lives. This would result in less memory headroom
available to the system, and more frequent garbage collections.
We have implemented several strategies to mitigate the damage that might be
caused by a bad hint. The first is very simple; every k collections (of the mature space), the JVM does an immediate (the normal, non-deferred) collection. We re- fer call the value k the immediate collection period. This guarantees that all dead objects will eventually be collected (if the program eventually triggers enough col- lections). The VM used has only two generations; the nursery, and the mature space. Deferral is not used at all for nursery objects.
A second strategy exploits the fact that the deferred collector is built on top of a generational garbage collector (see 2 for an overview). Recall that in a generational collector, new objects are allocated in a small, frequently collected space. If a key object supplied by the programmer is in the nursery, the collector does not immediately act on the hint. Rather, it waits until the key object survives a nursery collection. This means that a particularly bad hint may die and be collected in a nursery collection before it ever has the opportunity to cause harm.
Another strategy is to monitor mutations. Suppose that there is a pointer from a deferred object A to another object B. Then the mutator modifies this pointer so that it no longer refers to B. It is possible that in doing this, the mutator has caused B to become unreachable. However, the collector will not collect B since its deferred
57 bit will still be set. Fortunately, we can monitor such mutations in the mutator.
Since the deferred collector is implemented on top of a generational collector, there is already a write barrier in place for every pointer write. If the deferred collector observes more than a few (we use 10 as a default) such pointer updates, the next collection is an immediate collection (an immediate collection does not respect the deferred bit).
4.3.1 Application Programming Interface
The interface determines how the programmer can give hints to the garbage collec- tor about what objects are key objects. One option would be to provide annotations that let the programmer statically annotate certain elements of the program. For ex- ample, all objects returned to an annotated method could be treated as key objects, or all objects of a particular type, or those produced by an annotated constructor.
However, whether an object is a good key object or not may depend on a run time condition. For example, a certain data structure may only be good candidate based on user input. Therefore, instead of a static annotation, we want to make it possible for hints to be executed conditionally in the same way as any state- ment in the program. Thus, the interface we expose is a single method named traceInfrequently. Objects passed to traceInfrequently are hints to the garbage collector.
4.4 JikesRVM
In this section, we discuss some of the technical challenges of implementing and de- bugging a garbage collector, specifically on JikesRVM. The Jikes Research Virtual
Machine (JikesRVM) (Alpern et al., 2005a) is a self-hosting Java Virtual Machine, written in Java. We choose JikesRVM because we had group expertise at Tufts, and
58 because its memory management system is modular by design. Although the mod- ularity of its memory management system offers considerable advantages, the fact that it is self hosting makes it extremely challenging to debug and evaluate for sev- eral reasons. Firstly, any object allocated by JikesRVM resides in the same heap as the program being run. This is challenging for several reasons; for one, the size of
JikesRVM itself (just booting it allocates approximately 100,000 objects, depend- ing on the exact configuration used) which can easily dominate the memory usage when examining a smaller program.
Secondly, this makes it very difficult to allocate memory for JikesRVM itself to use. For example, suppose a programmer wishes to maintain a list of machine words in the garbage collector. Since JikesRVM is written in Java, the most obvious choice would be to allocate an object of type ”int[]” (an integer array). However, this object would be allocated in the heap the VM is currently trying to collect.
This is not permitted, since it would violated the collector’s invariants (new objects appearing while it tries to traverse all objects). Instead, one must use some of the low-level primitive operations JikesRVM exposes, and grab a page of raw memory and write words into it. While arrays are fairly easy to handle in this manner, more complicated data structures can be quite tedious to implement. To implement a singly linked list, one would need to create functions that splits memory from a page sufficient to hold each node, and another that given a memory address, computes the offset from that memory address where the ”next” field would be. Since everything is of the primitive type Address (similar to a C program where everything is type void*), the effectiveness of static type checking is dramatically reduced. Nothing from the standard library is available, since it all allocates Java level objects.
Another consequence of JikesRVM being written in Java is that traditional de- buggers (such as gdb) are not very helpful. While they can be attached to the
JikesRVM process, they are not aware of any of the semantics of the Java language
59 or how they relate to the actual layout of memory. So, for example, the programmer
has no way of examining variables on the stack, since the location of the Java stack
is not exposed to gdb.
At the same time, debuggers for the Java programming language are also un-
helpful. This is because JikesRVM is actually written in a variant of Java that has
been extended to allow low level primitive types and operations, such as Addresses
(basically pointers). A debugger for Java has no idea how to make sense of these.
Since it is very difficult to use a debugger or create auxiliary structures for de-
bugging, this leaves two primary methods. Copious amounts of logging, and copi-
ous amounts of assertion checking. Both techniques can have similar draw backs.
Logging a lot of information dramatically slows down program execution, and the
programmer will often discover after a long run of the program that he need to log
just one more piece of information to make sense of the out put; at which point the
entire run of the program must be repeated.
4.4.1 Sanity Checker
JikesRVM includes a sanity checker for garbage collectors. Is essentially a mark sweep collector implemented as simply as possible, with no performance optimiza- tions (it is, for example, single threaded). The sanity checker does not actually reclaim any memory, rather it is run before the garbage collector and records which objects it found to be reachable. Then, after the garbage collector runs its results are compared against the sanity checker. Since the sanity checker is simple, it is assumed to be correct.
As the sanity checker already traverses the heap to check invariants , it is a good place to add additional assertions checking other invariants. For example, in the deferred collector we want to check the invariant that all objects reachable from a deferred collector are deferred. This is testable (although slowly) using simple
60 modifications to the sanity checker.
There are two major downsides to using the sanity checker in this way. For one, it is very slow; it is a single threaded collector, and must traverse the entire heap even on a nursery collection. It essentially adds an extra collection on top of each collection. The exact slow down varies considerably depending on the parameters given to the collector, as well as the program being run. Using a 100 MB heap with a 4 Mb nursery, on the generational mark sweep collector, we observed a geomean slowdown of 14 on the DaCapo benchmarks.
Second, it can only check assertions at the time it is run. It gives you little infor- mation about specifically when the assertion was violated, or what code might have caused the violation. This usually means you have to combine it with trace based debugging techniques, acquiring some object of interest from the sanity checker, and working back through the trace to understand the history of that object. Com- bining the sanity checker, potentially expensive assertions, and tracing results in an execution that is very slow. The exact slowdown will depend on what informa- tion one wants out of the trace, and how much engineering effort one is willing to put into speeding up the process. We typically observed slowdowns slowdowns as several hundred for the most intensive debugging traces.
4.5 Experimental Results for The Deferred Collector
First we present results on a synthetic benchmark, and then a real benchmark from the DaCapo suite.
4.5.1 Methodology
We observed the mark/cons ratio (the ratio between the number of marking opera- tions done, and the number of allocations) as we increase the period of immediate
61 collections. For the purposes of this experiment, we count setting the deferred bit as a ”mark”, since marking an object and deferring an object require the same amount of work (setting one bit in the object header). An immediate collection period of
1 means that every collection is immediate (effectively the same as an unmodified generational garbage collector). An immediate collection period greater than the number of garbage collections means that all collections are deferred.
JikesRVM can alter the size of the nursery at run time. In order to avoid any interference from the nursery resizing algorithm, we fixed the nursery size at 4 MB.
Since the amount of allocation and marking is not deterministic in the DaCapo benchmarks, we used five runs of each configuration, and took the best of the five runs for each value of immediate collection frequency.
4.5.2 Doubly Linked List
Our doubly linked list program is a simple example that illustrates how the deferred collector can be used. It allocates a long doubly linked list (100,000 nodes), which lives for the entire lifetime of the program. The program then churns through mem- ory by allocating equally sized linked lists, and disconnecting them from the roots, ensuring the will be collected when the garbage collector runs (that is, they have shorter lifetimes). They key object used is simply the head of the long lived list.
This is a good case for the deferred collector, since the it is able to save the effort of repetedly traceing the long lived list.
We see the resulting mark/cons ration in figure 4.2. When all collections are immediate (i.e. the immediate collection period is one), the mark/cons ration is
0.053. It drops to 0.026 with an immediate collection period of 3. At the maximum meaningful immediate collection period of seven, the mark/cons ration is 0.019.
This indicates that the deferred collector eliminates about half of the marking work, which is the expected result This indicates that the deferred collector eliminates
62 Figure 4.2: mark/cons ratio for the doubly linked list program, run with a 16 MB heap, and differing values of immediate collection period. about half of the marking work, which is the expected result.
4.5.3 sunflow
Sunflow is a ray tracer that is part of the DaCapo suite of benchmarks. In order to find the key objects, we run sunflow through Elephant Tracks and analyze the resulting traces, and find the allocation contexts of potential candidates. This is somewhat complicated by the fact that Elephant Tracks does not run on JikesRVM, the VM on which the deferred collector is implemented. Some VM specific features actually show up in the Java heap (such as choices in the class libraries). These means that finding the key objects cannot be entirely automated, and requires some manual inspection of code to determine the appropriate objects and points in the code.
63 After analyzing the traces, we found for static source code locations to give
hints to the deferred collector:
1. In the constructor of TriangleMesh, we send the TriangleMesh itself to the
deferred collector. The TriangleMesh is used to represent the scene being
traced.
2. In the method render of BucketThread, we invoke traceInfrequently on
each of several thread objects. These threads do the rendering.
3. In the constructor of RenderObjectMap. RenderObjectMap is a hash map
that maps objects in the scene to a data structure containing information about
their rendering.
4. The constructor of Sunflow. This class maintains some data structures for
managing the benchmark harness itself.
We found that sunflow requires a minimum heap size of 21 MB to run to com- pletion on JikesRVM, and performed experiments with two heap size configura- tions; 1.25 times this minimum size, and 1.5 the minimum size (27 MB and 32
MB, respectively). At the 1.25x heap size, the virtual machine with only eight full heap garbage collections, meaning beyond this there is not much garbage collector work to be saved. In the 1.5x configuration, the virtual machine runs the garbage collector only three times.
Figure 4.3 shows the resulting mark/cons ratio for the 1.25x (27 MB heap) con-
figuration with a varying value for the immediate collection period. With an imme- diate collection period of 1 (meaning all collections are immediate), the mark/cons ratio is 0.031. With am immediate collection period of 3 (meaning every third col- lection is immediate), the mark/cons ratio decreases to 0.029 (6 % decrease). Since there are only eight full heap garbage collections performed, when the immediate
64 Figure 4.3: mark/cons ratio for sunflow, run with a 27 MB heap, and differing values of immediate collection period.
collection period is greater than eight, all collections are deferred; this results in a mark/cons ratio of 0.025 (20 % reduction in mark/cons).
Figure 4.4 shows that as the immediate collection frequency increases, the amount of mark/cons ratio decreases. As there are fewer garbage collections, there is less opportunity to save work. There is a sharp decline from 0.012 to 0.011 as the deferred collector is used with an immediate collection ever three collections (6 % decline in mark cons). Since there are only three full heap garbage collections per- formed, when the immediate collection period is four or greater, all the collections are deferred.
65 Figure 4.4: mark/cons ratio for sunflow, run with a 32 MB heap, and differing values of immediate collection period.
4.6 Related Work
There are two pieces of closely related work. The first is Cohen and Petrank’s ”Data
Structure Aware Garbage Collector” (Cohen and Petrank, 2015). This scheme re- lies on the programmer annotating the methods of specific data structures to clarify which methods add nodes to a data structure, and which methods remove nodes.
The nodes can then be marked in a single pass by copying from one side table to the mark side table, without traversal. This can give good performance, but is not profitable if removal is too frequent. So, if some instances of a data structure require frequent removal and others do not, two copies of that data structures source code may be required. The Data Structure aware garbage collector was able to reduce running time by 31 % in the hsqldb benchmark, from the earlier DaCapo 2006 suite.
The second is the clustered collection (Cutler and Morris, 2015). The Clus-
66 tered Collector operates in a way similar to the Deferred Collector. In this scheme, the garbage collector identifies clusters of objects using a heuristic, and tries to avoid tracing within those clusters as long as a certain head object is still reach- able. If any mutations occur within the cluster, it is traced normally. The Clustered
Collector was implemented on a a Scheme system that runs a completely different set of programs than the JVM the Deferred Collector is implemented on, making a direct emperical comparison difficult. The only benchmark evaluated was the
”Hacker News” application, a web content management system. The Clustered
Collector was able to decrease garbage collector paused times by one-third, but reduced garbage collector throughput by 10 % on this application.
Somewhat less closely related, Hirzel et al.’s Connectivity Based Garbage Col- lection (Hirzel et al., 2003) organizes objects into separate spaces based on their connectivity, and tries to collect these spaces independently (similar to the way in which the nursery is collected in a generational collector). It employes heuristics based on the number of roots pointing into a given region, and attempts to collect the most profitable regions first based on this heuristic.
Detlefts et al.’s ”Garbage First Collector” (Detlefs et al., 2004) similarly breaks the heap down into regions. However, it has a concurrent mark phase (that is, the mark phase runs concurrently with the mutator). Although the mutator is con- current, the garbage first collector does pause the mutator to collect a region, by copying marked objects elsewhere. The goal of its region selection is to achieve a user-defined minimum pause time. They reported being able to attain specified soft real time goals over 98 % of the time in many applications.
67 Chapter 5
Conclusion
In this thesis, we have presented a tool, Elephant Tracks, that lets us gather a great deal of information about the JVM, in particular the way it uses memory. We have shown how this tool could be used to build a garbage collector, the deferred garbage collector.
Elephant Tracks runs on any JVM that supports the JVMTI interface, is able to trace real programs, and provides much more information than any similar tool with a cost not much greater.
The deferred collector shows one possible way garbage collection could be changed based on object lifetime information. While there are many possibilities that could be explored, it represents one way to give the programmer some control of the collector with out sacrificing memory safety.
68 Bibliography
Agesen, O., Detlefs, D., and Moss, J. E. B. (1998). Garbage collection and local variable type-precision and liveness in Java virtual machines. In PLDI, pages 269–279.
Alpern, B., Augart, S., Blackburn, S. M., Butrico, M., Cocchi, A., Cheng, P., Dolby, J., Fink, S., Grove, D., Hind, M., et al. (2005a). The jikes research virtual ma- chine project: building an open-source research community. IBM Systems Jour- nal, 44(2):399–417.
Alpern, B., Augart, S., Blackburn, S. M., Butrico, M. A., Cocchi, A., Cheng, P., Dolby, J., Fink, S. J., Grove, D., Hind, M., McKinley, K. S., Mergen, M. F., Moss, J. E. B., Ngo, T. A., Sarkar, V., and Trapp, M. (2005b). The Jikes Research Virtual Machine project: Building an open-source research community. IBM Systems Journal, 44(2):399–418.
Baker, H. G. (1992). The treadmill: Real-time garbage collection without motion sickness. SIGPLAN Not., 27(3):66–70.
Blackburn, S., Garner, R., McKinley, K. S., Diwan, A., Guyer, S. Z., Hosking, A., Moss, J. E. B., Stefanovic,´ D., et al. (2006). The DaCapo benchmarks: Java benchmarking development and analysis. In ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, ACM SIGPLAN Notices 41(10), pages 169–190, Portland, OR. ACM Press.
Bruneton, E., Langlet, R., and Coupaye, T. (2002). ASM: A code manipulation tool to implement adaptable systems. In Adaptable and Extensible Component Systems, Grenoble, France. 12 pages.
Cohen, N. and Petrank, E. (2015). Data structure aware garbage collector.
Cutler, C. and Morris, R. (2015). Reducing pause times with clustered collection. In Proceedings of the 2015 ACM SIGPLAN International Symposium on Memory Management, ISMM 2015, pages 131–142, New York, NY, USA. ACM.
Detlefs, D., Flood, C., Heller, S., and Printezis, T. (2004). Garbage-first garbage collection. In Proceedings of the 4th International Symposium on Memory Man- agement, ISMM ’04, pages 37–48, New York, NY, USA. ACM.
69 Foucar, J. (2006). A Platform for Research into Object-Level Trace Generation. PhD thesis, The University of New Mexico.
Guyer, S. Z., McKinley, K. S., and Frampton, D. (2006). Free-Me: A static analy- sis for automatic individual object reclamation. PLDI, ACM SIGPLAN Notices, 41(6):364–375.
Hayes, B. (1991). Using key object opportunism to collect old objects. In Con- ference Proceedings on Object-oriented Programming Systems, Languages, and Applications, OOPSLA ’91, pages 33–46, New York, NY, USA. ACM.
Hertz, M. and Berger, E. D. (2004). Automatic vs. explicit memory management: Settling the performance debate.
Hertz, M., Blackburn, S. M., Moss, J. E. B., McKinley, K. S., and Stefanovic,´ D. (2002). Error-free garbage collection traces: How to cheat and not get caught. SIGMETRICS Perform. Eval. Rev., 30:140–151.
Hertz, M., Blackburn, S. M., Moss, J. E. B., McKinley, K. S., and Stefanovic, D. (2006). Generating object lifetime traces with Merlin. ACM Transactions on Programming Languages and Systems, 28(3):476–516.
Hirzel, M., Diwan, A., and Henkel, J. (2002). On the usefulness of type and live- ness accuracy for garbage collection and leak detection. ACM Transactions on Programming Languages and Systems (TOPLAS), 24(6):593–624.
Hirzel, M., Diwan, A., and Hertz, M. (2003). Connectivity-based garbage collec- tion. In ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, pages 359–373.
Inoue, H., Stefanovic,´ D., and Forrest, S. (2006). On the prediction of Java object lifetimes. IEEE Transactions on Computers, 55(7):880–892.
Jones, R. E. and Ryder, C. (2008). A study of Java object demographics. In Pro- ceedings of the 7th International Symposium on Memory Management, pages 121–130. ACM.
Lambert, J. M. and Power, J. F. (2008). Platform independent timing of Java vir- tual machine bytecode instructions. Electronic Notes in Theoretical Computer Science, 220(3):97–113.
Rojemo,¨ N. and Runciman, C. (1996). Lag, drag, void and use—heap profil- ing and space-efficient compilation revisited. In Proc. Intl. Conf. on Functional Programming. SIGPLAN Not., 31(6):34–41.
Sun Microsystems (2004). JVM Tool Interface. http://java.sun.com/javase/6/docs/- platform/jvmti/jvmti.html.
70 Uhlig, R. A. and Mudge, T. N. (1997). Trace-driven memory simulation: A survey. ACM Computing Surveys (CSUR), 29(2):128–170.
Xu, G. (2013). Resurrector: A tunable object lifetime profiling technique for opti- mizing real-world programs. In OOPSLA’13, volume 48, pages 111–130. ACM.
71