Efficient Tracing of Cold Code Via Bias-Free Sampling

Efficient Tracing of Cold Code via Bias-Free Sampling Baris Kasikci, École Polytechnique Fédérale de Lausanne (EPFL); Thomas Ball, Microsoft; George Candea, École Polytechnique Fédérale de Lausanne (EPFL); John Erickson and Madanlal Musuvathi, Microsoft https://www.usenix.org/conference/atc14/technical-sessions/presentation/kasikci This paper is included in the Proceedings of USENIX ATC ’14: 2014 USENIX Annual Technical Conference. June 19–20, 2014 • Philadelphia, PA 978-1-931971-10-2 Open access to the Proceedings of USENIX ATC ’14: 2014 USENIX Annual Technical Conference is sponsored by USENIX. Efficient Tracing of Cold Code via Bias-Free Sampling 1 †2 ‡1 §2 ¶2 Baris Kasikci ∗, Thomas Ball , George Candea , John Erickson , and Madanlal Musuvathi 1School of Computer and Communication Sciences, EPFL 2Microsoft Abstract systems [12, 15] that record most execution events in a Bugs often lurk in code that is infrequently executed (i.e., program incur a large overhead, whereas sampling strate- cold code), so testing and debugging requires tracing gies that collect fewer runtime events for both profiling such code. Alas, the location of cold code is generally and tracing [16] incur less overhead. not known a priori and, by definition, cold code is elu- In testing and debugging, there is a need to sample in- sive during execution. Thus, programs either incur un- frequently executed (i.e., cold) instructions at runtime, necessary runtime overhead to “catch” cold code, or they because bugs often lurk in cold code [9, 23]. However, must employ sampling, in which case many executions we don’t know a priori which basic blocks will be cold are required to sample the cold code even once. vs. hot at runtime, therefore we cannot instrument just We introduce a technique called bias-free sampling the cold ones. To make matters worse, traditional tem- (BfS), in which the machine instructions of a dynamic poral sampling techniques [21, 24] that trade off sam- execution are sampled independently of their execution pling rate for sampling coverage can miss cold instruc- frequency by using breakpoints. The BfS overhead is tions when the sampling rate is low, requiring many exe- therefore independent of a program’s runtime behavior cutions to gain acceptable coverage. As a result, develop- and is fully predictable: it is merely a function of pro- ers do not have effective and efficient tools for sampling gram size. BfS operates directly on binaries. cold code. We present the theory and implementation of BfS for In this paper, we present a non-temporal approach both managed and unmanaged code, as well as both ker- to sampling that we call bias-free sampling (BfS). BfS nel and user mode. We ran BfS on a total of 679 pro- is guaranteed to sample cold instructions without over- grams (all Windows system binaries, Z3, SPECint suite, sampling hot instructions, thereby reducing the overhead and on several C# benchmarks), and BfS incurred per- typically associated with temporal sampling. formance overheads of just 1–6%. The basic idea is to sample any instruction of interest the next time it executes and without imposing any overhead on any other instructions in the program. 1 Introduction We do this using code breakpoints (a facility present in Monitoring a program’s control-flow is a fundamental all modern CPUs) dynamically. We created lightweight way to gain insight into program behavior [5]. At one code breakpoint (LCB) monitors for both the kernel and extreme, we can record a bit per basic block that mea- user mode of Windows for both native (with direct sup- sures whether or not a block executed over an entire ex- port in the kernel) and managed applications (with a user- ecution (coverage) [29]. At another extreme, we can space monitor) on both Intel and ARM architectures. record the dynamic sequence of basic blocks executed To ensure that none of the cold instructions are missed, (tracing) [28]. In between these two extremes there is the bias-free sampler inserts a breakpoint at every ba- a wide range of monitoring strategies that trade off run- sic block in the program, both at the beginning of the time overhead for precision. For example, record-replay program execution and periodically during the execution. This ensures at least one sample per period of every ∗baris.kasikci@epfl.ch cold instruction. We also show how to sample without † [email protected] bias hot instructions independently of their execution fre- ‡george.candea@epfl.ch §[email protected] quency at a low rate. ¶[email protected] Devising an efficient solution that works well in prac- 1 USENIX Association 2014 USENIX Annual Technical Conference 243 tice on a large set of programs requires solving multiple 2.1 Program Rewriting challenges: (a) processing a large number of breakpoints, in the worst case simultaneously on every instruction in A traditional approach to monitoring program behavior the program (existing debugging frameworks are unable is static program rewriting as done by Gcov [13], which to handle such high volumes because their design is not takes as input an executable E and outputs a new exe- optimized for a large number of breakpoints that must be cutable E that is functionally the same as E except that processed quickly); (b) handling breakpoints correctly in it monitors the behavior of E. At Microsoft, many such the presence of a managed code interpreter and JIT op- monitoring tools have been built on top of the Vulcan timizations (managed code gets optimized during exe- binary rewriting framework [27], such as the code cover- cution, therefore it cannot be handled the same way as age tool bbcover. Vulcan provides a number of program native code); and (c) preserving the correct semantics of abstractions, such as the program control-flow graph, and programs and associated services, such as debuggers. the tool user can leverage these abstractions to then use A particular instance of LCB that we built is the the Vulcan APIs to add instructions at specific points in lightweight code coverage (LCC) tool. We have success- the binary. Vulcan ensures that the branches of the pro- fully run LCC at scale to simultaneously measure code gram are adjusted to reflect this addition of code. coverage on all processes and kernel drivers in a stan- Another approach to monitoring is dynamic program dard Windows 8 machine with imperceptible overheads. rewriting, as done by DynInst [7] and Pin [22], as well We also have extended LCC with the ability to record as Microsoft’s Nirvana and iDNA framework [6]. Many periodic code coverage logs. LCC is now being used in- of the tools built with rewriting-based approaches, both ternally at Microsoft to measure code coverage. static and dynamic, use “always-on” instrumentation Using breakpoints overcomes many of the pitfalls of (they keep the dynamically-added instrumentation until code instrumentation. CPU support for breakpoints al- the program terminates), even when for goals that should lows setting (a) a breakpoint on any instruction, (b) an ar- be much less demanding, like measuring code coverage. bitrary number of breakpoints, and (c) setting or clearing a breakpoint without synchronizing with other threads 2.2 Efficient Sampling (with the exception of managed code) that could poten- tially execute the same instruction. Static or dynamic program rewriting approaches that are The contributions and organization of this paper are: always-on incur prohibitive overheads, and they cannot • We analyze and dissect common approaches to cold sample cold code in a bias-free manner. code monitoring, showing that there is need for im- In 2001, Arnold et al. introduced a framework for re- provement (§2); ducing the cost of instrumented code that combines in- • We present our BfS design (§3) and its efficient and strumentation and counter-based sampling of loops [24]. comprehensive implementation using breakpoints In this approach, there are two copies of each procedure: for both the kernel and user mode of Windows for The “counting” version of the procedure increments a both native and managed applications (§4); counter on procedure entry and a counter for each loop • We show on a total of 679 programs that with our back edge, ensuring that there is no unbounded portion implementation of LCB, our coverage tool LCC, of execution without some counter being incremented. which places a breakpoint on every basic block in an When a user-specified limit is reached, control transfers executable and removes it when fired, has an over- from the counting version to a more heavily instrumented head of 1-2% on a variety of native C benchmarks version of the procedure, which (after recording the sam- and an overhead of 1-6% on a variety of managed ple) transfers control via the loop back edges back to the C# benchmarks (§5); counting version. In this way, the technique can record more detailed information about acyclic intraprocedural • We show how to use periodic BfS to extend LCC to paths on a periodic basis. quickly build interprocedural traces with overheads Hirzel et al. extended this method to reduce overhead in the range of 3-6% (§6). further and to trace interprocedural paths [17]. They §7 discusses related work and §8 concludes with a dis- implemented “bursty tracing” using Vulcan, and report cussion of applications for BfS. runtime overheads in the range of 3-18%. In further work [16] they sample code at a rate inversely propor- 2 From Rewriting to Bias-Free Sampling tional to its frequency, so that less frequently executed code is sampled more often. This approach is based on In this section, we provide background on the approaches the premise that bugs reside mainly on cold paths. used to monitor program behavior, and outline the con- Around the same time, Liblit et al.

Load more