Towards Practical Default-On Multi-Core Record/Replay

Towards Practical Default-On Multi-Core Record/Replay Ali Jose´ Mashtizadeh Tal Garfinkel David Terei Stanford University Stanford University Stanford University [email protected] [email protected] [email protected] David Mazieres` Mendel Rosenblum Stanford University Stanford University http://www.scs.stanford.edu/~dm/ [email protected] Abstract 1. Introduction We present Castor, a record/replay system for multi-core Record/replay is used in a wide range of open source and applications that provides consistently low and predictable commercial applications: replay debugging [1–3, 36], can overheads. With Castor, developers can leave record and record difficult bugs when they occur and reproduce them replay on by default, making it practical to record and offline, decoupled analysis [2], can record execution online reproduce production bugs, or employ fault tolerance to with low overhead then replay it offline with high overhead recover from hardware failures. dynamic analysis tools, and hardware fault tolerance [14, 41], Castor is inspired by several observations: First, an effi- can record execution and replay it on a separate machine to cient mechanism for logging non-deterministic events is criti- enable real-time failover when hardware failures occur. cal for recording demanding workloads with low overhead. Unfortunately, the performance overhead of record/replay Through careful use of hardware we were able to increase log systems is frequently prohibitive, particularly in multi-core throughput by 10× or more, e.g., we could record a server settings. Often this limits their use to test and development handling 10× more requests per second for the same record environments and makes them impractical for implementing overhead. Second, most applications can be recorded without fault tolerance [5]. modifying source code by using the compiler to instrument language level sources of non-determinism, in conjunction Our goal with Castor is to provide record/reply with with more familiar techniques like shared library interposi- low enough overheads that developers are willing to leave tion. Third, while Castor cannot deterministically replay all recording on by default [17, 45], making it practical to record data races, this limitation is generally unimportant in practice, and reproduce difficult production bugs, or to enable fault contrary to what prior work has assumed. tolerance in high performance network services. Castor currently supports applications written in C, C++, The widespread use of multi-core architectures poses a and Go on FreeBSD. We have evaluated Castor on parallel key challenge to achieving this goal. Record/replay systems and server workloads, including a commercial implementa- work by recording and replaying non-deterministic program tion of memcached in Go, which runs Castor in production. inputs. Multi-core architectures introduce a new source of non-determinism in the form of shared memory interactions, CCS Concepts • Software and its engineering ! Oper- most frequently for synchronization between threads running ating systems on separate cores. Open source [36], commercial [1, 2] and some research Keywords Multi-Core Replay, Replay Debugging, Fault- systems [20] cope with this challenge by eliminating multi- Tolerance core execution, and its resultant non-determinism, by deterministically scheduling all threads on a single core. Other Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without research systems constrain and record non-determinism by fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must statically or dynamically adding and recording synchroniza- be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. tion [19, 26, 30], or use speculative execution [8, 29, 37, 49] ASPLOS ’17 April 8–12, 2017, Xi’an, China. to find a deterministic replay. These approaches have signifi- c 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM. ISBN 978-1-4503-4465-4/17/04. $15.00 cant runtime overhead and scalability issues that limit their DOI: http://dx.doi.org/10.1145/3037697.3037751 usefulness for production workloads. Another approach is possible that avoids these overheads. has been concern that this could undermine the usefulness When source code is available, explicit synchronization—e.g., of replay [29, 30] by interfering with fault tolerance or bug locks, language-level atomics, or (in legacy code) volatile reproducibility. However, this limitation is of negligible im- variables—can be directly instrumented to capture and record portance today for several reasons. inter-thread non-determinism [18, 22, 23, 39]. Castor builds First, modern languages no longer support “benign data on this approach with several new observations: races,” i.e., races intentionally included for performance. Hardware-optimized logging is essential for keeping Any benefits they offered in the past are now supported record/replay overheads low as core counts and workload through relaxed-memory-order atomics [11]. Thus, all data intensities increase. Non-deterministic events, such as system races are undefined behavior, i.e., bugs on the same order as calls and synchronization, are recorded to and replayed from dereferencing a NULL pointer [12, 47]. a log. Maximizing log throughput is critical, since record Next, data races appear exceedingly rare relative to other overheads increase in proportion to the log throughput con- bugs, thus are unlikely to interfere with reproducing other sumed, modulo cache effects. Simply put, the more resources bugs on replay. For example, over a four-year period, there we spend on logging, e.g., CPU, memory bandwidth, the less were 165 data races found in the Chromium code base, 131 the application has to accomplish real work. of which were found by a race detector, compared to 65,861 Castor maximizes log throughput by eliminating con- total bugs, or about 1 in 400. Further, none of these bugs made tention. For example, hardware synchronized time stamp it into release builds, so at least in this example, data races counters are used as a contention-free source of ordering for would likely have no impact on the production use of replay. events. This and other optimizations improve log throughput Other studies of data races suggest that most do not introduce by a factor of 3×–10× or more—e.g., if recording a web sufficient non-determinism to impact replay [25, 34], and server with a given amount of load imposes a 2% overhead when they do, there are often viable ways of coping. We without these optimization, with them, we can record 3×– explore this further in x6. 10× more load (requests per second), for the same 2%. Castor works with normal debugging, analysis and profil- Minimizing log latency is also critical, as it directly ing tools such as gdb, lldb, valgrind, and pmcstat. It can impacts application response times. For example, if a thread is also replay modified binaries, enabling the use of ThreadSan- descheduled during a performance-critical logging operation, itizer and similar tools by recompiling and relinking before it can introduce delay, as other application threads that depend replay. Special asserts and printfs can also be added after on the pending operation are prevented from making forward the fact. We discuss using Castor, along with its performance progress. This can result in user-visible performance spikes. on server and parallel workloads in x5. To prevent this, we make use of a unique property of transactional memory which forces transactions to abort on protection ring crossings. This ensures that performance 2. Overview critical operations will abort if they are descheduled, thus preventing delay. We discuss logging further in x3. Castor records program execution by writing the outcome of Record/replay can often be enabled transparently, i.e., non-deterministic events to a log so they can be reproduced without any program modification. Castor and similar sys- at replay time, as depicted in Figure 1. tems record sources of non-determinism, such as synchro- Two types of non-determinism are recorded: input non- nization, by instrumenting source code. Other systems do determinism and ordering non-determinism. Input non- this manually, e.g., developers replace user level locks with determinism occurs when an input changes from one ex- wrapper functions. Manual instrumentation can be tedious ecution to another, such as network data from a recv system and error prone, often does not support all sources of non- call, or the random number returned by a RDRAND instruc- determinism, and can introduce additional and unpredictable tion; in these cases, we record the input to the log. Ordering performance overheads. non-determinism occurs when the order operations happen in To avoids this, Castor uses a custom LLVM compiler pass changes the outcome of an execution—for example, the order to automatically record user level non-determinism including that two threads read data from the same socket or acquire a atomics, ad hoc synchronization with volatiles, compiler lock. In these cases, we also record a counter that captures intrinsics, and inline

Towards Practical Default-On Multi-Core Record/Replay

Hoja De Datos De Familes Del Procesador Intel(R) Core(TM) De 10A Generación, Vol.1

GPU Developments 2018

SIMD Extensions

Demystifying Internet of Things Security Successful Iot Device/Edge and Platform Security Deployment — Sunil Cheruvu Anil Kumar Ned Smith David M

FAST HASHING in CUDA Neville Walo Department of Computer

A Fast, Verified, Cross-Platform Cryptographic Provider

Extracting and Mapping Industry 4.0 Technologies Using Wikipedia

Application Specific Programmable Processors for Reconfigurable Self-Powered Devices

10Th Gen Intel® Core™ Processor Families Datasheet, Vol. 1

Oracle Solaris 10 113 Release Notes

Família De Processadores Intel® Core™ Da 10ª Geração

Datasheet Product Features