TARDIS: Affordable Time-Travel Debugging in Managed Runtimes
Total Page:16
File Type:pdf, Size:1020Kb
TARDIS: Affordable Time-Travel Debugging in Managed Runtimes Earl T. Barr Mark Marron University College London Microsoft Research [email protected] [email protected] Abstract 1. Introduction Developers who set a breakpoint a few statements too late or Developers spend a great deal of time debugging. Skilled de- who are trying to diagnose a subtle bug from a single core velopers apply the scientific method: they observe a buggy dump often wish for a time-traveling debugger. The ability program’s behavior on different inputs, first to reproduce the to rewind time to see the exact sequence of statements and bug, then to localize the bug in the program’s source. Re- program values leading to an error has great intuitive appeal producing, localizing, then fixing a bug involves forming but, due to large time and space overheads, time-traveling and validating hypotheses about the bug’s root cause, the debuggers have seen limited adoption. first point at which the program’s state diverges from the in- A managed runtime, such as the Java JVM or a JavaScript tended state. To validate a hypothesis, a developer must halt engine, has already paid much of the cost of providing core program execution and examine its state. Especially early in features — type safety, memory management, and virtual debugging, when the developer’s quarry is still elusive, she IO — that can be reused to implement a low overhead time- will often halt execution too soon or too late to validate her traveling debugger. We leverage this insight to design and current hypothesis. Overshooting is expensive because re- build affordable time-traveling debuggers for managed lan- turning to a earlier point in a program’s execution requires guages. TARDIS realizes our design: it provides affordable re-running the program. time-travel with an average overhead of only 7% during nor- Although modern integrated development environments mal execution, a rate of 0:6 MB/s of history logging, and (IDEs) provide a GUI for halting program execution and ex- a worst-case 0:68s time-travel latency on our benchmark amining program state, tool support for debugging has not applications. TARDIS can also debug optimized code using changed in decades; in particular, the cost of program restart time-travel to reconstruct state. This capability, coupled with when placing a breakpoint too late in a program’s execu- its low overhead, makes TARDIS suitable for use as the de- tion remains. A “time-traveling” debugger (TTD) that sup- fault debugger for managed languages, promising to bring ports reverse execution can speed the hypothesis formation time-traveling debugging into the mainstream and transform and validation loop by allowing developers to navigate back- the practice of debugging. wards, as well as forwards, in a program’s execution history. Additionally, the ability to perform efficient forward and re- Categories and Subject Descriptors D.2.5 [Testing and verse execution serves as an enabling technology for other Debugging]: Debugging aids debugging tools, like interrogative debugging [31, 36] or au- Keywords Time-Traveling Debugger, Managed Runtimes tomatic root cause analysis [12, 22, 27, 29]. Despite the intuitive appeal of a time-traveling debugger1, “Debugging involves backwards reasoning” and many research [4, 20, 30, 36, 54] and industrial [15, 21, Brian Kernighan and Rob Pike 26, 47, 51] efforts to build TTD systems, they have not seen The Practice of Programming, 1999 widespread adoption for one of two reasons. The first issue is that TTD systems can impose prohibitive runtime over- heads on a debugger during forward execution: 10–100× Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed execution slowdown and the cost of writing multi-GB log for profit or commercial advantage and that copies bear this notice and the full citation files. The second issue is long pause times when initiating on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or reverse execution, the additional functionality TTD offers, republish, to post on servers or to redistribute to lists, requires prior specific permission which can take 10s of seconds. Usability research shows that and/or a fee. Request permissions from [email protected]. wait times of more than 10 seconds cause rapid system aban- OOPSLA ’14, October 20–24, 2014, Portland, OR, USA. Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-2585-1/14/10. $15.00. 1 Time-traveling debuggers have been variously called omniscient, trace- http://dx.doi.org/10.1145/2660193.2660209 based, bi-directional, or reversible. donment and that wait times of more than a few seconds the underlying OS APIs, can be made. Thus, by construction, frustrate users and trigger feature avoidance [40]. Grounded managed runtimes provide a small instrumentation surface on these results, we define an affordable time-traveling de- for adding functionality to record/replay non-deterministic bugger as one whose execution overhead is under 25% and environmental interactions, including file system, console, whose time-travel latency is under 1 second. and network I/O. Section 3.4 describes how TARDIS cap- We present TARDIS, the first TTD solution to achieve tures these interactions using specialized library implemen- both of these goals. On our benchmarks, forward debugging tations. These implementations examine program state as overhead averages 7% and time-travel latency is sub-second, well as memory layout to make decisions such as storing at 0:68s, as detailed2 in Section 5. In short, TARDIS works only a file offset to obviate seeking over a large immutable to realize Hegel’s adage that a sufficient quantitative change, file, or converting environmental writes or operations that in TTD performance, becomes qualitative change, in prac- overwrite data on which nothing depends into NOOPs dur- tical utility. TARDIS rests on the insight that managed run- ing replay. To ensure deterministic replay of multi-threaded times, such as the .Net CLR VM or a JavaScript engine, have programs, TARDIS, like other systems [30, 48, 51], multi- paid the cost of providing core features — type safety, mem- plexes all threads onto a single logical CPU and logs the ory management, and virtual IO — that can be reused to context switches and timer events. Section 3.3 explains how build an affordable TTD system. Thus, constructing TTD TARDIS uses efficient persistent [19] implementations of within a managed runtime imposes 1) minimal additional specialized data structures to virtualize its file system. This cost bounded by a small constant factor and 2) more fully section also shows how to leverage standard library file se- realizes the return on the investment managed runtimes have mantics in novel ways to optimize many common file I/O made in supporting these features. scenarios, such as avoiding the versioning of duplicate writes A standard approach for implementing a TTD system is and writes that are free of data dependencies. to take snapshots of a program’s state at regular intervals This paper makes the following contributions: and record all non-deterministic environmental interactions, such as console I/O or timer events, that occur between snap- • We present TARDIS, the first affordable time-traveling shots. Under this scheme, reverse execution first restores the debugger for managed languages, built on the insight that nearest snapshot that precedes a reverse execution target, a managed runtime has already paid the cost of providing then re-executes forward from that snapshot, replaying en- core features — type safety, memory management, and vironmental interactions with environmental writes rendered process virtualization — that can be reused to implement idempotent, to reach the target. a TTD system, improving the return on the investment Memory management facilities allow a system to traverse made in supporting these features. and manipulate its heap at runtime; type information allows • We describe a baseline, nonintrusive TTD design that a system to do this introspection precisely. Without these can retrofit existing systems (Section 3.1); on our bench- features, a TDD system must snapshot a process’s entire ad- marks, this design’s performance results — an average dress space (or its dirty pages), copying a variety of use- slowdown of 14% (max of 22%) and 2:2 MB/s (max of less data such as dead-memory, buffered data from disk, JIT 5:6 MB/s) of logging — apply to any managed runtime state, code, etc. [20, 30]. Managed runtimes provide both and demonstrate the generality of our approach. features. Section 3 shows how TARDIS exploits them to effi- ciently identify and snapshot only live objects and program • We present an optimized design of TARDIS in Section 3.2 state, thereby improving on the basic snapshot and record/re- that we tightly integrated into a .Net CLR runtime to play TDD design. Section 3.2 leverages insights from work leverage its features and (optionally) additional hardware on garbage-collection to further reduce the cost of produc- resources to reduce TARDIS’s average runtime overhead ing state snapshots. TARDIS is the first system, two decades to only 7% (max of 11%), record an average of 0:6 MB/s after it was initially suggested [54], to implement piggyback- (max of 1:3 MB/s) of data, and achieve time-travel latency ing of snapshots on the garbage collector. Beyond this work of less than 0:68s seconds on our benchmarks. sharing optimization, TARDIS also utilizes remembered-sets • We show how TARDIS can run heavily optimized code, from a generational collector to reduce the cost of walking so long as 4 lightweight conditions are met, by time- the live heap and a write-barrier to trap writes so it can snap- traveling to reconstruct source-level debugging informa- shot only modified memory locations.