Efficient Scalable Thread-Safety-Violation Detection Finding thousands of concurrency bugs during testing
Guangpu Li Shan Lu Madanlal Musuvathi University of Chicago University of Chicago Microsoft Research Suman Nath Rohan Padhye Microsoft Research University of California, Berkeley
Abstract Keywords thread-safety violation; concurrency bugs; Concurrency bugs are hard to find, reproduce, and debug. debugging; reliability; scalability They often escape rigorous in-house testing, but result in ACM Reference Format: large-scale outages in production. Existing concurrency- Guangpu Li, Shan Lu, Madanlal Musuvathi, Suman Nath, bug detection techniques unfortunately cannot be part and Rohan Padhye. 2019. Efficient Scalable Thread-Safety- of industry’s integrated build and test environment due Violation Detection: Finding thousands of concurrency bugs to some open challenges: how to handle code developed during testing. In ACM SIGOPS 27th Symposium on Oper- by thousands of engineering teams that uses a wide ating Systems Principles (SOSP ’19), October 27–30, 2019, variety of synchronization mechanisms, how to report lit- Huntsville, ON, Canada. ACM, New York, NY, USA, 19 pages. https://doi.org/10.1145/3341301.3359638 tle/no false positives, and how to avoid excessive testing resource consumption. This paper presents TSVD, a thread-safety violation detector that addresses these challenges through a new design point in the domain of active testing. Unlike pre- 1 Introduction vious techniques that inject delays randomly or employ Concurrency bugs are hard to find, reproduce, and debug. expensive synchronization analysis, TSVD uses light- Race conditions can hide in rare thread interleavings weight monitoring of the calling behaviors of thread- that do not occur during testing but nevertheless show unsafe methods, not any synchronization operations, to up in production. Thus, it is not uncommon for even the dynamically identify bug suspects. It then injects corre- simplest of concurrency bugs to escape rigorous testing sponding delays to drive the program towards thread- but result in large-scale outages in production. Even unsafe behaviors, actively learns from its ability or in- when detected, these bugs are hard to reproduce in the ability to do so, and persists its learning from one test small. As such, tools for finding concurrency bugs is an run to the next. TSVD is deployed and regularly used actively studied research area [7, 22, 43, 47, 58] with in Microsoft and it has already found over 1000 thread- many open challenges. safety violations from thousands of projects. It detects This paper specifically deals with a class of concur- more bugs than state-of-the-art techniques, mostly with rency errors we call thread-safety violations or TSVs. just one test run. Libraries and classes specify an informal thread-safety contract that determines when a client can and cannot CCS Concepts • Software and its engineering → Soft- concurrently call into the library/class. A TSV occurs ware maintenance tools; Empirical software validation; when a client violates this contract. Figure 1 shows an • Computer systems organization → Reliability; example. The implementation of the Dictionary class allows multiple threads to concurrently call read oper- ContainsKey Permission to make digital or hard copies of all or part of this ations, such as , but requires write opera- work for personal or classroom use is granted without fee provided tions, such as Add, to only be called in exclusive contexts. that copies are not made or distributed for profit or commercial By violating this contract, Figure 1 contains a TSV. advantage and that copies bear this notice and the full citation This paper is motivated by the prevalence of such on the first page. Copyrights for components of this work owned TSVs in production code. In fact, our evaluation discov- by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or ered that the pattern in Figure 1 is fairly common as to redistribute to lists, requires prior specific permission and/or a developers erroneously assume that concurrent accesses fee. Request permissions from [email protected]. on different keys are “thread safe.” TSVs, while common, SOSP ’19, October 27–30, 2019, Huntsville, ON, Canada can have drastic consequences, including unexpected © 2019 Association for Computing Machinery. ACM ISBN 978-1-4503-6873-5/19/10. . . $15.00 Guangpu Li is supported by an SOSP 2019 student scholarship https://doi.org/10.1145/3341301.3359638 from the National Science Foundation. 1 // Dictionary dict DataCollider, … 2 // Thread 1: 3 dict.Add(key1, value) 4 // Thread 2: 5 dict.ContainsKey(key2) TSVD Figure 1. A thread-safety violation (TSV) bug RaceFuzzer, CTrigger, … # of delay locations # of delay Cost of identifying delay locations crashes or worse, silent data corruption that are difficult to diagnose and fix. Figure 2. The design space of active testing This paper describes TSVD, a dynamic-analysis tool for detecting TSVs. The tool is specifically designed to be used in an integrated build and test environment, such as performing analysis, either through run-time overhead Bazel [4] or CloudBuild [15]. Here, hundreds to thousands for dynamic analysis or the inscalability of precise static of machines build and run tests on software modules from analysis required for identifying potentially conflicting thousands of engineering groups. We have designed the accesses. system end to end to operate at this scale. For instance, At the other extreme, as illustrated in the upper-left it is important to design tools that produce little to corner of Figure 2, tools like DataCollider [14] conduct no false bug reports. A small loss of productivity per- little to no analysis, but instead probabilistically inject user in chasing false bugs can quickly multiply to huge delays at many program locations. Such tools require productivity losses across the organization. Similarly, it many repeated executions of the same test to find a bug, is important to design tools that incur small resource which we deem unacceptable in our context as described overhead. For instance, if a tool incurs a 10× run-time above. overhead or requires rerunning the same test 10× times, The goal of TSVD is to explore the middle ground in then the test environment needs to allocate 10× more the design space of delay-injection tools, as illustrated machines to provide the same turnaround time to its in Figure 2. TSVD uses lightweight instrumentation and users. dynamic analysis to identify potential conflicting accesses Since TSVs generalize the notion of data races to without paying the overhead while being effective in coarser-grain objects, techniques for detecting TSVs are finding bugs. naturally related to data-race detection (Section 2 has First, TSVD performs a lightweight near-miss tracking more discussion). But existing approaches fall short for to identify potential thread-safety violations. It dynami- our purposes. cally tracks threads that invoke conflicting methods to Considering the goal about having little to no false bug the same object close to each other in real time, without reports, the approach of active delay-injection [14, 47, 58] monitoring or analyzing synchronization operations. is promising. This approach directs program threads Second, TSVD uses a happens-before (HB) inferenc- towards making conflicting accesses by delaying threads ing technique that leverages the feedback from delay at strategic locations at run time. Consequently, this injections to prune call pairs that are likely ordered by approach only reports bugs that are validated at runtime, the happens-before relation [34] and thus cannot execute and thus are not false bugs. concurrently. The key observation is that if there is a HB Considering the goal about having small overhead relationship between two events 푎 and 푏, then a delay however, existing active delay-injection techniques do in 푎 will cause a delay in 푏. By inferring this causal- not work. To better understand their limitations, we ity, TSVD avoids injecting delays at these pruned pairs can broadly classify them based on the inherent tradeoff subsequently. Note, that unlike HB analysis performed between the cost of injecting too many delays versus by dynamic data-race detectors, HB inference does not the sophistication of analysis required to select where to require modeling, identifying, and analyzing synchroniza- inject delays: tion operations in a program, which is expensive and At one extreme, as illustrated in the lower-right corner complicated given the wide variety of synchronization of Figure 2, tools like RaceFuzzer [58] and CTrigger [47] mechanisms in production software. conduct static or dynamic analysis of memory accesses An interesting benefit of the near-miss tracking and and synchronization operations in a program to identify the HB inferencing is that it only requires local instrumen- potential buggy locations, and selectively inject delays tation. That is, only the libraries/classes whose thread- only at these locations. While this reduces the number safety contract is checked need to be instrumented. All of delays injected, these techniques bear the cost of other parts of the code, including synchronizations such as forks, joins, locks, volatile accesses, and ad-hoc syn- 2 Motivation and Background chronization, do not need to be instrumented. This mod- 2.1 Why Concurrency Testing in the Large? ularity dramatically reduces the runtime overhead. Fur- The broader goal of this paper is to find concurrency thermore, it enables incremental “pay-as-you-go” thread- errors during testing, ideally “in the small.” Our spe- safety checking of different classes and libraries, greatly cific goal here is to build a concurrency testing toolas simplifying the deployment of TSVD to thousands of part of integrated build and test environments that are engineering groups, who can configure the tool based on becoming popular in software companies today [4, 15]. their needs. Such an environment provides a centralized code reposi- Another key benefit of TSVD is that itis independent tory for all the different engineering groups. The code of the underlying concurrency programming model and is incrementally built on every commit and on specific oblivious to synchronization semantics, and thus can work successful builds; a collection of servers automatically across different programming models, such as threads, run unit and functional tests. It is not uncommon to task-based programming, asynchronous programming, have hundreds of servers dedicated to run tests from etc., and be applied to software using a variety of syn- tens of thousands of modules written by many different chronization primitives. groups. Finally, TSVD compacts its delay injections, delay- Testing at this scale introduces key challenges: to location identification and adjustment all into the same minimize productivity loss, tools should have little/no run. In contrast, RaceFuzzer and CTrigger, require at false positives; to reduce overheads, the tools should have least two runs, first to do the analysis and the second minimal runtime overhead and not require rerunning the to inject delays. In many cases, as our evaluation shows, test too many times; for wide adoption across a variety TSVD finds a large fraction of bugs in the first test run, of product groups, the tool should be “push button” making the best use of testing resources. This also allows requiring little or no configuration and support a variety TSVD to be effective where running many repetitions of of programming models. TSVD discussed in this paper the same test is not feasible due to large test setup and is one such tool. shutdown overheads. We evaluate TSVD on tests from roughly 43,000 soft- 2.2 Why Thread-Safety Violations? ware modules developed by various product teams at Microsoft. By design, TSVD produces no false error re- A data race occurs when two threads concurrently access ports, with each report containing stack traces of the the same variable and at least one of these accesses is two conflicting operations that violate the thread-safety a write. Thread-safety violations are a generalization of contract. Overall, TSVD has found over 1,000 previously- data races to objects and data structures. Each class unknown bugs, involving over 1,000 unique static pro- or library specifies, sometimes implicitly, a thread-safety gram locations and over 20퐾 unique call-stacks. To contract that determines the set of functions that can validate our results, we drilled into reports from four be called concurrently by threads in the program. For product teams in Microsoft. These groups confirmed instance, a dictionary implementation might allow two that all the bugs reported are real bugs, 48% have been threads to concurrently perform lookups, while another classified as high-priority bugs that would have otherwise implementation might disallow such calls if lookups can resulted in service outages, and 70% have been fixed so sometimes trigger a “rebalance” operation. In the ex- far. Our evaluation also shows that TSVD introduces an treme, a lock-free implementation might concurrently acceptable overhead to the testing process — about 33% allow any pairs of calls. overhead for multi-threaded test cases while traditional A thread-safety violation occurs when the client fails techniques incur several times slowdowns. to meet the thread-safety contract of a class or library, TSVD is incorporated into the integrated build and as discussed earlier in Figure 1. While the notion of test environment at Microsoft. Over last one year, 1,500+ thread-safety contract can be more general, in this paper software projects under active development at Microsoft we will assume that the methods of a data structure have been tested with TSVD. An open-source version can be grouped into two sets — the read set and the of TSVD is available at https://github.com/microsoft/ write set, such that two concurrent methods violate the TSVD. thread-safety contract if and only if at least one of them belongs to the write set. All the data structures we study in this paper have this property. Thread-safety violations are easy to make but hard to find in code reviews. Sometimes programmers deliber- ately make such concurrent accesses due to their misun- derstanding of the thread-safety contract. In fact, TSVD is particularly inspired by a thread-safety violation simi- 1 async Task
3.2 DynamicRandom 3.4.1 Where to inject delays? The simplest algorithm takes every TSVD point as an At high level, TSVD uses lightweight analysis to dy- trap set dangerous eligible delay location (where), and injects delays at namically maintain a set, called , of - random moment (when). In other words, the call to pairs of program locations that can likely contribute to should_delay return true with some (small) probability. thread-safety violations and injects delays at only these Once deciding to delay a thread, the thread sleeps for a locations. random amount of time. During a testing run, the size of the trap set grows as TSVD discovers new dangerous pairs, or shrinks as TSVD prunes existing dangerous pairs. 3.3 StaticRandom TSVD identifies a pair of program locations as adan- Previous work that uses delay injections to expose data gerous pair if (1) they correspond to a near-miss TSV races, such as DataCollider [14], observed that dynamic (§3.4.2) — intuitively, TSVD hopes to turn previous near sampling is ineffective, as it tends to insert a lot more misses into true conflicts, and (2) at least one of these delays along hot paths and ignore cold paths where even two locations runs in a concurrent phase with multiple more bugs may be hidden. To avoid redundantly intro- active threads of the program (§3.4.3). ducing delays in hot paths and to focus on cold paths, A dangerous pair is pruned from the trap set if (1) DataCollider samples memory accesses “statically” — TSVD infers a likely happen-before relationship between that is, static program locations are sampled uniformly the two locations (§3.4.4) and hence the pair is unlikely irrespective of the number of times a particular location to violate thread-safety, or (2) a violation is already is sampled. We emulated this algorithm in our Stati- found at the pair. cRandom variant, where static call sites of relevant data Finally, the delay injection is probabilistic—each pro- structures’ methods are sampled uniformly. gram location 푙표푐 in the trap set is associated with a probability 푃푙표푐 at which delay is injected at location t0 t2 >훿ℎ푏 ∗ 푑푒푙푎푦_푡푖푚푒 푙표푐. TSVD sets 푃푙표푐 = 1 when a dangerous pair contain- ing 푙표푐 is added to the trap set, and the probability is Thd2 푙표푐0 푙표푐2 푙표푐… 푙표푐푘ℎ푏 decayed with time (§3.4.5). This probabilistic design can help avoid excessive overhead caused by too many delays for dangerous pairs that are actually not bugs. Thd1 delay:d 푙표푐1 We now describe the lightweight mechanisms TSVD uses to infer the key components required to make the 푡1푠푡푎푟푡 푡1푒푛푑 above decisions. Note that, our evaluation section will show thorough parameter-sensitivity study of all the Figure 6. Happens-before inference in TSVD (the thick thresholds/parameters mentioned below. arrows indicate inferred happens-before relationship)
3.4.4 Inferring likely HB relationship 3.4.2 Identifying near misses TSVD’s identification of dangerous pairs based on near- misses and concurrent phases can potentially be incor- Intuitively, if a pair of program locations never even rect. For instance, two accesses that are consistently get close to creating a thread-safety violation under the protected by the same lock can happen close to each intensive testing environment, they might just never be other. TSVD will fail to convert these near misses into able to create one due to underlying program seman- true conflicts due to the dynamic happens-before (HB) tics and synchronization operations. On the other hand, relation between the two accesses. Rather than track say an access1 happens at program location 푙표푐 with 1 synchronizations to calculate this HB relation, TSVD the access triple 푡푖푑 , 표푏푗 , 표푝 at time 푡 and another 1 1 1 1 uses the lightweight technique below to infer likely HB access happens at program location 푙표푐 with the access 2 relationships. triple 푡푖푑 , 표푏푗 , 표푝 at time 푡 . TSVD considers the pair 2 2 2 2 TSVD’s dynamic HB inference works based on the 푙표푐1, 푙표푐2 as a dangerous pair if 푡푖푑1 , 푡푖푑2, 표푏푗1 = 표푏푗2, following crucial observation. If 푙표푐1 happens-before 푙표푐2 at least one of 표푝1 and 표푝2 is a write, and |푡1 − 푡2| ≤ 훿 in an execution, then a delay right before 푙표푐1 will cause for some time threshold 훿. a proportional delay of 푙표푐2. As shown in Figure 6, To make this judgment, TSVD maintains a list of consider the case when 푙표푐1 and 푙표푐2 are consistently accesses per object since 훿 time ago. For implementation protected by a lock and 푙표푐1 occurs first. When one simplicity, rather than storing this state in the object injects a delay right before 푙표푐1, this delay occurs when metadata, TSVD maintains a global hash table indexed the thread executing 푙표푐1 is still holding the lock. Thus, by the object’s hash-code containing this list of previous when the thread executing 푙표푐2 tries to acquire this lock accesses. On each access, TSVD consults this list to (which it needs to before accessing 푙표푐2), it will block. identify dangerous pairs. By dynamically tracking this causality between program locations, TSVD infers likely HB relationships between them. Tracking this causality works as follows. Apart from 3.4.3 Inferring Concurrent Phases per-object history described above, TSVD additionally Synchronization operations like forks, joins, barriers, maintains a history of the most recent access made by and locks could lead to sequential execution phases every thread. Consider a delay 푑1 injected right before during the execution of a concurrent program, like ini- access 푙표푐1 that starts at 푡1start and ends at 푡1end. At tialization phase, clean-up phase, join-after-fork phase, a subsequent access at 푙표푐2 in a different thread 푇 ℎ푑2 and so on, and a TSVD point inside such a sequential happening at time 푡2, we will check when the previous phase can never execute in parallel with another TSVD access from 푇 ℎ푑2 occurred, denoted as 푡0, and infer a point. dynamic HB relationship between 푙표푐1 and 푙표푐2 if (1) TSVD infers concurrent execution phases by checking there is a long gap between 푙표푐2 and that previous access, the execution history of TSVD points. Specifically, it 푡2 − 푡0 ≥ 훿ℎ푏 * 푑푒푙푎푦_푡푖푚푒, a causal delay threshold maintains a global history buffer with a fixed number and (2) the long gap overlaps with the injected delay, of most recently executed TSVD points. If the TSVD 푡0 ≤ 푡1end. If multiple delays {푑 at 푙표푐} satisfy the points in this buffer come from more than one thread, inferring condition for 푇 ℎ푑2 and 푙표푐2, we contribute the TSVD considers the execution to be inside a concurrent 1For the ease of discussion, we do not differentiate an access to phase. an object and a thread-unsafe method call of an object in this section. long gap in 푇 ℎ푑2 to the most recently finished delay 푑 set from the file, allowing it to inject delays at pairs even and infer the HB relationship accordingly. Considering at their first occurrences. the transitivity of happens-before relationship, the next Parallel delay injection There are clear trade-off among 푘 accesses in thread 푇 ℎ푑 are also considered as likely ℎ푏 2 injecting only one delay or multiple delays throughout a happens-after 푙표푐 . Again, 푘 is a tunable parameter 1 ℎ푏 run, and allowing only one thread or multiple threads to and we will evaluate its sensitivity in the evaluation be delayed at a time: when multiple delays are injected, section. they could cancel each other’s effect; when few delays are 3.4.5 Delay Decaying injected, we will need too many runs to test all possible delay situations. Of course, all the inferences above, including the near- TSVD decides to conduct delay injection in an ag- miss tracking, the concurrent phase inference, and the gressive way — delays are inserted strictly following the likely HB-relationship inference, still do not guarantee dangerous pair list and the decay-probability scheme, to prune out all non-TSV pairs. Consequently, for every regardless of whether another thread is already blocked. dangerous pair {푝 , 푝 } in the dangerous list, every delay 1 2 Note that, although this aggressive strategy can cause injection at either 푝 or 푝 that fails to make 푝 and 1 2 1 some injected delays to overlap with each other, the 푝 execute in parallel with the same object accessed 2 overlaps will only be partial and are unlikely to cancel will cause TSVD to decay the probability of future delay each other out. This is because different threads will be injection at 푝 and 푝 . When the probability of a location 1 2 unblocked at different time and our decay-probability 푙표푐 drops to 0, all its related pairs are removed from the scheme also helps avoid too many threads blocked at the trap set. same time. On the other hand, an alternative design that 3.4.6 When to inject delays? strictly avoids delay overlaps would lead to too few delay injections and hence hurts our chance of exposing bugs Different from previous work that separates delay plan- within the tight testing budget, as we will demonstrate ning and delay injection into different runs47 [ , 58] in the evaluation section. and/or using many runs to gradually inject delays to all planned locations [14, 47, 58], TSVD conducts its 3.5 TSVD with happens-before analysis delay planning and injections in the same run, making To compare TSVD with existing techniques that rely on the best use of testing resources. HB analysis, we designed a variant of TSVD, referred Planning & injection in the same run Once TSVD iden- to as TSVDHB, that follows the approach of RaceFuzzer tifies a dangerous pair of program locations (that nearly [58]. Note that, TSVDHB uses a few optimizations to missed each other), TSVD does not wait until the next speed up traditional HB analysis following the type of run to expose a potential thread-safety violation between bugs it targets. them. The rationale is that most instructions execute Where to inject delays? Just like that in TSVD, TSVD more than once and hence most bugs have multiple HB dynamically maintains a trap set, with every program chances to manifest. This is confirmed by our experimen- location appearing in the trap set a delay-injection can- tal results in Section 5 showing that TSVD hits the same didate. At run time, TSVD adds a pair of program bug multiple times. Consequently, TSVD can still try HB locations {푙표푐 , 푙표푐 } into the trap set, if the correspond- to expose a bug after an initial near-miss. Specifically, 1 2 ing accesses 푡푖푑 , 표푏푗 , 표푝 and 푡푖푑 , 표푏푗 , 표푝 satisfy that whenever a program location 푙표푐 is going to be executed, 1 1 1 1 1 1 푡푖푑1 , 푡푖푑2, 표푏푗1 = 표푏푗2, 표푝1 and 표푝2 conflict, and most TSVD checks if 푙표푐 is part of the current trap set. If it importantly they do not have happens-before relation- is, TSVD will insert a delay at 푙표푐 with probability 푃 . 푙표푐 ship with each other. Multiple testing runs The above delay injection scheme TSVDHB monitors synchronization operations, such would miss a bug if the near-miss situation observed as locks, forks, and joins, and uses vector clocks [18] to by TSVD was actually the only chance to expose the compute the happens-before relation between accesses. bug; i.e., if the dangerous pair of locations never run TSVDHB optimizes traditional race detection to speed together after the pair is identified. When the testing up its HB analysis following the observation that, dif- resource allows, TSVD runs the program for another ferent from traditional race detection, where execution time, carrying the knowledge obtained from the first run, traces contain few synchronization operations but many to help discover these bugs. memory accesses, TSVDHB faces many synchronization During the first run, TSVD records its trap set in operations (the large number of async/await in C# a persistent trap file. At the end of the run, the trap cloud service modules) yet relatively few accesses (i.e., file contains the final set of dangerous pairs. Atthe TSVD points — invocations of thread-unsafe methods beginning of the second run, TSVD initializes its trap of some data structures). The first optimization is to increase local timestamps at accesses (TSVD points), which are relatively few at 1 List
(e) Inference size(푘ℎ푏) (f) Buffer size (g) Decay factor (h) Delay time
Figure 9. Sensitivity analysis of various parameters of TSVD.
a thread responsible for listening from multiple clients. 1 Async Task
5.6 Examples of bugs found by TSVD Production-Incident Example This is a bug about two Device Manager As shown in Figure 10(a), a device threads trying to sort a list at the same time in a pro- manager uses a Dictionary GlobalStatus to maintain duction run. The sorting result of an unprotected list the status of every client, where client ID is the key and is undetermined when two threads are doing that con- the client status is the value. The device manager has currently. This undetermined behavior propagated and Project LoC # tests # run # TSV overhead positives [30, 45, 58]. Several techniques have used sam- ApplicationInsights [3] 67.5K 934 2 1 15.31% pling to reduce the runtime overhead [5, 31, 42], but like DataTimeExtention [12] 3.2K 169 1 3 18.51% DataCollider [14], they need to repeatedly run the test FluentAssertion [20] 78.3K 3076 1 2 8.89% many times to expose bugs. K8s-client [33] 332.3K 76 2 1 11.79% Radical [50] 96.9K 965 1 3 1552.13% Another type of passive detection tools like AVIO [38] Sequolocity [59] 6.6K 209 1 3 2.97% and Bugaboo [39] catches concurrency bugs when they Stastd [62] 2.5K 34 2 1 9.72% manifest at run time. Since they do not predict or expose System.Linq.Dynamic [63] 1.2K 7 1 1 41.39% bugs that have not manifested yet, which TSVD does, Thunderstruck [64] 1.1K 52 1 2 3.33% they do not need to analyze happens-before relation but Table 4. TSVD results on open source projects. are also fundamentally unsuitable for the build and test environment TSVD targets. finally caused the service to go down for several hours. Race detection was studied for asynchronous event- TSVD can reproduce this bug without any prior knowl- driven programs such as Android apps and Web applica- edge and help the developers reduce the debugging effect. tions [25, 26, 40, 53]. There, the key challenge is to model and analyze the causality between asynchronous requests 5.7 TSVD on Open Source Projects and responses, which incurs huge run-time slowdowns. TSVD instead automatically infers happens-before rela- To evaluate whether TSVD can be used beyond Mi- tionships. While not the focus of this paper, techniques crosoft, we applied TSVD to 9 open source C# projects behind TSVD can work for applications targeted by (Table 4). Without any existing C# bug benchmark these works. suite, we searched Github using “race condition” key- word and identified these 9 that contain (1) confirmed Systematic Testing Systematic testing techniques dis- bug reports about TSVs on standard library APIs and cover concurrency errors by driving the program towards (2) developer-written test cases. different possible interleavings [6, 7, 21, 22, 24, 35, 43]. Using exactly the same parameters as before, TSVD These techniques are either guided by the need to cover successfully detects and triggers all the TSVs under test all interleavings within some bound [21, 22, 35, 43], or a in at most 2 runs. For all but 3 projects, TSVD detects coverage notion [6, 24], or provide a probabilistic guar- these bugs using the original test cases released with the antee of finding bugs [7]. While these techniques can buggy software — if TSVD was used, these bugs would find general concurrency bugs, they are not designed to have been caught before code release. For Thunderstruck, discover most bugs in a small number of runs. Instead, TSVD detects one TSV that was not part of the original TSVD is specifically designed for finding TSVs within the bug report. first few runs and without paying the cost of controlling We also evaluated the performance of TSVD on all the thread scheduler. the test cases. The overhead is mostly <20%, consistent with earlier performance results. Two projects incur large Generating Unit Tests Tools were proposed to synthe- overhead, as they have many short-running tests (<1 size unit tests to help expose concurrency bugs inside ms). The average slowdown for their tests is actually less library functions [9, 49, 54, 55]. Given multi-threaded than 400ms. libraries that are expected to allow safe concurrent in- vocations, these tools synthesize concurrent method-call 6 Related Work sequences to help expose bugs inside the library imple- mentation. They are orthogonal to TSVD: TSVD focuses Data Race Detection Since the tolerance for false error on improving the efficiency of exposing thread-safety vi- reports is low in our context, we do not focus on static olation bugs through existing unit tests of software that data-race-detection techniques [13, 16, 36, 66]. TSVD uses thread-unsafe libraries. is closely related to dynamic active data-race detection techniques [47, 58], as discussed in Section 1 and Fig- Timing hypothesis Snorlax [29] reproduced and diag- ure 2. nosed in-production concurrency bugs leveraging a coarse Dynamic passive detection techniques [19, 48, 57, 60, interleaving hypothesis. Snorlax experiments showed that 65] perform happens-before analysis, lock-set analysis, or the time elapsed between events leading to concurrency a combination of the two to predict whether a data race bugs ranges between 154 and 3505 micro-seconds, based could have manifested. Recent work [27, 32, 61] infer on which Snorlax could reproduce concurrency bugs more data races by generalizing beyond executions that without fine granularity monitoring and recording. This are happens-before equivalent to the current execution. coarse-interleaving hypothesis and TSVD’s near-miss These techniques pay the runtime cost of monitoring tracking look at different aspects of concurrency bug’s synchronization operations and also suffer from false timing window characteristics — Snorlax believes the timing window is not as small as people used to think, [2] Aleksandr Mikunov. [n. d.]. Rewrite MSIL Code and TSVD believes conflicting accesses in small windows on the Fly with the .NET Framework Profiling are more likely to lead to real bugs — and leverage the API. https://www.cnblogs.com/WCFGROUP/p/ characteristics in different ways. 5136703.html. [3] ApplicationInsights. [n. d.]. Broadcast pro- Causality inference Past work has used run-time trace cessor is dropping telemetry due to race analysis to infer happens-before relationship among mes- condition. https://github.com/Microsoft/ sage sending/receiving operations or distributed sys- ApplicationInsights-dotnet/issues/994. tem tasks in the context of system-performance analysis [4] Bazel. [n. d.]. Bazel: Build and test software of any [1, 10] and network-dependency analysis [8]. Due to the size, quickly and reliably. https://bazel.build/. different usage scenarios, the exact inference algorithms [5] Michael D. Bond, Katherine E. Coons, and differ between TSVD and these previous tools. Previous Kathryn S. McKinley. 2010. PACER: propor- tools all require a large number of un-perturbed system tional detection of data races. In Proceedings of traces, and statistically infer happens-before relation- the 2010 ACM SIGPLAN Conference on Program- ship based on whether two operations never flip execu- ming Language Design and Implementation, PLDI tion order [10] or always execute with a nearly constant 2010, Toronto, Ontario, Canada. physical-time gap in between [8]. Different from previous [6] Arkady Bron, Eitan Farchi, Yonit Magid, Yarden tools, TSVD works in the unique testing environment Nir, and Shmuel Ur. 2005. Applications of syn- where the system trace contains much perturbation in- chronization coverage. In Proceedings of the ACM troduced by TSVD; TSVD also faces the unique goal SIGPLAN Symposium on Principles and Practice of of finding bugs in a small number of runs, and hence Parallel Programming, PPOPP, Chicago, IL, USA. cannot wait until a large number of traces become avail- [7] Sebastian Burckhardt, Pravesh Kothari, Madanlal able. Consequently, TSVD’s happens-before inference Musuvathi, and Santosh Nagarakatte. 2010. A Ran- is uniquely designed based on observing delays in each domized Scheduler with Probabilistic Guarantees of testing run. Finding Bugs. In Proceedings of the Fifteenth Edi- tion of ASPLOS on Architectural Support for Pro- 7 Conclusion gramming Languages and Operating Systems (ASP- This paper presents a new thread-safety violation detec- LOS XV). 167–178. tion technique TSVD. Being part of an integrated build [8] Xu Chen, Ming Zhang, Zhuoqing Morley Mao, and and test environment, TSVD provides a push-button no- Paramvir Bahl. 2008. Automating Network Appli- false-positive bug detection for .NET programs that use cation Dependency Discovery: Experiences, Limita- complex multi-threaded and asynchronous programming tions, and New Solutions.. In OSDI, Vol. 8. 117–130. models with a wide variety of synchronization mecha- [9] Ankit Choudhary, Shan Lu, and Michael Pradel. nisms. TSVD provides a starting point for exploring 2017. Efficient Detection of Thread Safety Violations the wide design space of active testing and resource- via Coverage-guided Generation of Concurrent Tests. conscious delay injection design. Future research can In Proceedings of the 39th International Conference further explore how to balance bug exposing capability, on Software Engineering (ICSE). accuracy, and cost. [10] Michael Chow, David Meisner, Jason Flinn, Daniel Peek, and Thomas F Wenisch. 2014. The mystery Acknowledgments machine: End-to-end performance analysis of large- scale internet services. In 11th USENIX Symposium We would like to thank Lenin Ravindranath Sivalingam, on Operating Systems Design and Implementation for his help in building the first prototype, Suresh Thum- (OSDI 14). 217–231. malapenta, for his help with our experiments, our shep- [11] Csharpmm. [n. d.]. Standard ECMA-334 C# Lan- herd and the anonymous reviewers for their helpful feed- guage Specification. http://www.ecma-international. back. This work was supported in part by NSF grants org/publications/standards/Ecma-334.htm. funding CNS-1563956 and CNS-1514256. [12] DateTimeExtensions. [n. d.]. Resolve a random race condition. https://github.com/joaomatossilva/ References DateTimeExtensions/pull/86. [1] Marcos K Aguilera, Jeffrey C Mogul, Janet L [13] Dawson R. Engler and Ken Ashcraft. 2003. Rac- Wiener, Patrick Reynolds, and Athicha Muthi- erX: effective, static detection of race conditions and tacharoen. 2003. Performance debugging for dis- deadlocks. In Proceedings of the 19th ACM Sym- tributed systems of black boxes. In ACM SIGOPS posium on Operating Systems Principles , SOSP , Operating Systems Review, Vol. 37. ACM, 74–89. Bolton Landing, NY, USA. [14] John Erickson, Madanlal Musuvathi, Sebastian Bur- [25] Chun-Hung Hsiao, Satish Narayanasamy, Essam ckhardt, and Kirk Olynyk. 2010. Effective Data- Muhammad Idris Khan, Cristiano L. Pereira, and Race Detection for the Kernel.. In OSDI, Vol. 10. Gilles A. Pokam. 2017. AsyncClock: Scalable In- 1–16. ference of Asynchronous Event Causality. In Pro- [15] Hamed Esfahani, Jonas Fietz, Qi Ke, Alexei Kolomi- ceedings of the Twenty-Second International Con- ets, Erica Lan, Erik Mavrinac, Wolfram Schulte, ference on Architectural Support for Programming Newton Sanches, and Srikanth Kandula. 2016. Languages and Operating Systems, ASPLOS, Xi’an, CloudBuild: Microsoft’s Distributed and Caching China. Build Service. In SEIP (seip ed.). IEEE - Institute [26] Chun-Hung Hsiao, Cristiano Pereira, Jie Yu, Gilles of Electrical and Electronics Engineers. https: Pokam, Satish Narayanasamy, Peter M. Chen, Ziyun //www.microsoft.com/en-us/research/publication/ Kong, and Jason Flinn. 2014. Race Detection for cloudbuild-microsofts-distributed-and-caching-build-service/ Event-Driven Mobile Applications. In PLDI. [16] Facebook. [n. d.]. A tool to detect bugs in Java and [27] Jeff Huang, Patrick O’Neil Meredith, and Grigore C/C++/Objective-C code before it ships. https: Rosu. 2014. Maximal Sound Predictive Race De- //fbinfer.com/. tection with Control Flow Abstraction. In Proceed- [17] Mingdong Feng and Charles E. Leiserson. 1997. Ef- ings of the 35th ACM SIGPLAN Conference on ficient Detection of Determinacy Races in Cilk Pro- Programming Language Design and Implementation grams. In Proceedings of the Ninth Annual ACM (PLDI). Symposium on Parallel Algorithms and Architec- [28] Jb Evain. [n. d.]. Mono.Cecil. https://github.com/ tures (SPAA ’97). ACM, New York, NY, USA, 1–11. jbevain/cecil. https://doi.org/10.1145/258492.258493 [29] Baris Kasikci, Weidong Cui, Xinyang Ge, and Ben [18] Colin J Fidge. 1987. Timestamps in message-passing Niu. 2017. Lazy Diagnosis of In-Production Concur- systems that preserve the partial ordering. (1987). rency Bugs. In Proceedings of the 26th Symposium [19] Cormac Flanagan and Stephen N. Freund. 2009. on Operating Systems Principles (SOSP). FastTrack: Efficient and Precise Dynamic Race De- [30] Baris Kasikci, Cristian Zamfir, and George Candea. tection. In Proceedings of the 30th ACM SIGPLAN 2012. Data races vs. data race bugs: telling the Conference on Programming Language Design and difference with portend. In Proceedings of the 17th Implementation (PLDI). International Conference on Architectural Support [20] FluentAssertion. [n. d.]. Race condition for Programming Languages and Operating Systems, in SelfReferenceEquivalencyAssertionOp- ASPLOS , London, UK. tions.GetEqualityStrategy. https://github.com/ [31] Baris Kasikci, Cristian Zamfir, and George Candea. fluentassertions/fluentassertions/issues/862. 2013. RaceMob: crowdsourced data race detection. [21] Pedro Fonseca, Rodrigo Rodrigues, and Björn B In ACM SIGOPS 24th Symposium on Operating Brandenburg. 2014. SKI: Exposing Kernel Con- Systems Principles, SOSP, Farmington, PA, USA. currency Bugs through Systematic Schedule Explo- [32] Dileep Kini, Umang Mathur, and Mahesh ration. In 11th USENIX Symposium on Operating Viswanathan. 2017. Dynamic Race Prediction in Systems Design and Implementation (OSDI 14). 415– Linear Time. In Proceedings of the 38th ACM SIG- 431. PLAN Conference on Programming Language De- [22] Patrice Godefroid. 1997. Model checking for pro- sign and Implementation (PLDI). gramming languages using VeriSoft. In Proceedings [33] kubernetes-client. [n. d.]. fix a race condition. https: of the 24th ACM SIGPLAN-SIGACT symposium //github.com/kubernetes-client/csharp/pull/212. on Principles of programming languages. ACM, 174– [34] Leslie Lamport. 1978. Time, clocks, and the ordering 186. of events in a distributed system. Commun. ACM [23] Patrice Godefroid. 2005. The Soundness of Bugs 21, 7 (1978), 558–565. is What Matters. In PLDI’2005 Workshop on [35] Tanakorn Leesatapornwongsa, Mingzhe Hao, Pallavi the Evaluation of Software Defect Detection Tools Joshi, Jeffrey F Lukman, and Haryadi S Gunawi. (BUGS’2005). 2014. SAMC: Semantic-Aware Model Checking for [24] Shin Hong, Jaemin Ahn, Sangmin Park, Moonzoo Fast Discovery of Deep Bugs in Cloud Systems. In Kim, and Mary Jean Harrold. 2012. Testing Con- 11th USENIX Symposium on Operating Systems current Programs to Achieve High Synchronization Design and Implementation (OSDI 14). 399–414. Coverage. In Proceedings of the 2012 International [36] Bozhen Liu and Jeff Huang. 2018. D4: Fast Con- Symposium on Software Testing and Analysis (IS- currency Debugging with Parallel Differential Anal- STA). ysis. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). [48] Eli Pozniansky and Assaf Schuster. 2007. MultiRace: [37] Haopeng Liu, Shan Lu, Madan Musuvathi, and Efficient On-the-fly Data Race Detection in Multi- Suman Nath. 2019. What bugs cause production threaded C++ Programs: Research Articles. Con- cloud incidents?. In Proceedings of the Workshop on curr. Comput. : Pract. Exper. 19, 3 (March 2007), Hot Topics in Operating Systems. ACM, 155–162. 327–340. https://doi.org/10.1002/cpe.v19:3 [38] Shan Lu, Joseph Tucek, Feng Qin, and Yuanyuan [49] Michael Pradel and Thomas R. Gross. 2012. Fully Zhou. 2006. AVIO: detecting atomicity violations Automatic and Precise Detection of Thread Safety via access interleaving invariants. In Proceedings of Violations. In Proceedings of the 33rd ACM SIG- the 12th International Conference on Architectural PLAN Conference on Programming Language De- Support for Programming Languages and Operating sign and Implementation (PLDI ’12). ACM, New Systems, ASPLOS, San Jose, CA, USA. York, NY, USA, 521–530. https://doi.org/10.1145/ [39] Brandon Lucia and Luis Ceze. 2009. Finding Con- 2254064.2254126 currency Bugs with Context-aware Communica- [50] Radical. [n. d.]. MessageBroker internal subscrip- tion Graphs. In Proceedings of the 42Nd Annual tion(s) list is not thread safe. https://github.com/ IEEE/ACM International Symposium on Microar- RadicalFx/Radical/issues/108. chitecture (MICRO). [51] Raghavan Raman, Jisheng Zhao, Vivek Sarkar, [40] Pallavi Maiya, Aditya Kanade, and Rupak Majum- Martin Vechev, and Eran Yahav. 2010. Efficient dar. 2014. Race Detection for Android Applications. Data Race Detection for Async-finish Parallelism. In PLDI. In Proceedings of the First International Conference [41] J. Manson, W. Pugh, and S. Adve. 2005. The Java on Runtime Verification (RV’10). Springer-Verlag, memory model. In Proceedings of POPL. ACM, 378– Berlin, Heidelberg, 368–383. http://dl.acm.org/ 391. citation.cfm?id=1939399.1939430 [42] Daniel Marino, Madanlal Musuvathi, and Satish [52] Raghavan Raman, Jisheng Zhao, Vivek Sarkar, Mar- Narayanasamy. 2009. LiteRace: effective sampling tin Vechev, and Eran Yahav. 2012. Scalable and for lightweight data-race detection. In Proceedings Precise Dynamic Datarace Detection for Structured of the 2009 ACM SIGPLAN Conference on Pro- Parallelism. In Proceedings of the 33rd ACM SIG- gramming Language Design and Implementation, PLAN Conference on Programming Language De- PLDI 2009, Dublin, Ireland. sign and Implementation (PLDI ’12). ACM, New [43] Madan Musuvathi and Shaz Qadeer. 2006. CHESS: York, NY, USA, 531–542. https://doi.org/10.1145/ Systematic stress testing of concurrent software. In 2254064.2254127 International Symposium on Logic-based Program [53] Veselin Raychev, Martin T. Vechev, and Manu Srid- Synthesis and Transformation. Springer, 15–16. haran. 2013. Effective Race Detection for Event- [44] S. Narayanasamy, Z. Wang, J. Tigani, A. Edwards, Driven Programs. In OOPSLA. and B. Calder. 2007. Automatically Classifying Be- [54] Malavika Samak and Murali Krishna Ramanathan. nign and Harmful Data Races Using Replay Analy- 2014. Multithreaded Test Synthesis for Deadlock sis. Detection. In Proceedings of the 2014 ACM Interna- [45] Satish Narayanasamy, Zhenghao Wang, Jordan tional Conference on Object Oriented Programming Tigani, Andrew Edwards, and Brad Calder. 2007. Systems Languages & Applications (OOPSLA ’14). Automatically classifying benign and harmful data [55] Malavika Samak, Murali Krishna Ramanathan, and racesallusing replay analysis. In Proceedings of the Suresh Jagannathan. 2015. Synthesizing Racy Tests. ACM SIGPLAN 2007 Conference on Programming In Proceedings of the 36th ACM SIGPLAN Confer- Language Design and Implementation (PLDI), San ence on Programming Language Design and Imple- Diego, California, USA. mentation (PLDI). [46] Robert O’Callahan and Jong-Deok Choi. 2003. Hy- [56] Anirudh Santhiar and Aditya Kanade. 2017. Static brid Dynamic Data Race Detection. In Proceed- Deadlock Detection for Asynchronous C# Programs. ings of the Ninth ACM SIGPLAN Symposium on In Proceedings of the 38th ACM SIGPLAN Confer- Principles and Practice of Parallel Programming ence on Programming Language Design and Imple- (PPoPP). mentation (PLDI). [47] Soyeon Park, Shan Lu, and Yuanyuan Zhou. 2009. [57] Stefan Savage, Michael Burrows, Greg Nelson, CTrigger: exposing atomicity violation bugs from Patrick Sobalvarro, and Thomas Anderson. 1997. their hiding places. In ACM SIGARCH Computer Eraser: A Dynamic Data Race Detector for Multi- Architecture News, Vol. 37. ACM, 25–36. threaded Programs. ACM Trans. Comput. Syst. 15, 4 (Nov. 1997), 391–411. https://doi.org/10.1145/ 265924.265927 [58] Koushik Sen. 2008. Race directed random testing of concurrent programs. ACM Sigplan Notices 43, 6 (2008), 11–21. [59] Sequelocity.NET. [n. d.]. Race condition on TypeCacher. https://github.com/AmbitEnergyLabs/ Sequelocity.NET/pull/23. [60] Konstantin Serebryany and Timur Iskhodzhanov. 2009. ThreadSanitizer: data race detection in prac- tice. In Proceedings of the workshop on binary in- strumentation and applications. ACM, 62–71. [61] Yannis Smaragdakis, Jacob Evans, Caitlin Sadowski, Jaeheon Yi, and Cormac Flanagan. 2012. Sound Predictive Race Detection in Polynomial Time. In Proceedings of the 39th Annual ACM SIGPLAN- SIGACT Symposium on Principles of Programming Languages (POPL). [62] statsd.net. [n. d.]. Race conditions when updat- ing guage value. https://github.com/lukevenediger/ statsd.net/issues/29. [63] System.Linq.Dynamic. [n. d.]. Fix the multi- threading issue at ClassFactory.GetDynamicClass. https://github.com/kahanu/System.Linq.Dynamic/ pull/48. [64] Thunderstruck. [n. d.]. Race condition in Con- nectionStringBuffer singleton. https://github.com/ 19WAS85/Thunderstruck/issues/3. [65] Yuan Yu, Tom Rodeheffer, and Wei Chen. 2005. Racetrack: efficient detection of data race conditions via adaptive tracking. In ACM SIGOPS Operating Systems Review, Vol. 39. ACM, 221–234. [66] Sheng Zhan and Jeff Huang. 2016. ECHO: instan- taneous in situ race detection in the IDE. In Pro- ceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, Seattle, WA, USA.