CHESS: a Systematic Testing Tool for Concurrent Software

CHESS: A Systematic Testing Tool for Concurrent Software Madanlal Musuvathi Shaz Qadeer Microsoft Research Microsoft Research Thomas Ball Microsoft Research Technical Report MSR-TR-2007-149 Microsoft Research Microsoft Corporation One Microsoft Way Redmond, WA 98052 This page intentionally left blank. 1 CHESS: A Systematic Testing Tool for Concurrent Software Madanlal Musuvathi Shaz Qadeer Thomas Ball Microsoft Research Microsoft Research Microsoft Research Abstract der load. While stress testing does indirectly increase the variety of thread schedules, such testing is far from Concurrency is used pervasively in the development of sufficient. Stress testing does not cover enough different large systems programs. However, concurrent program- thread schedules and, as a result, yields unpredictable re- ming is difficult because of the possibility of unexpected sults. A bug may surface one week, when stress testing interference among concurrently executing tasks. Such happens to cover a low-probability schedule, and then interference often results in “Heisenbugs” that appear disappear for months. Stories are legend of the so-called rarely and are extremely difficult to reproduce and de- “Heisenbugs” that rarely surface and are hard to repro- bug. Stress testing, in which the system is run under duce. Still, the mark of reliability of a system remains its heavy load for a long time, is the method commonly ability to run for a long time under heavy load without employed to flush out such concurrency bugs. This crashing. form of testing provides inadequate coverage and has un- We introduce a new concept called concurrency sce- predictable results. This paper proposes an alternative nario testing that is applicable to large systems.The sys- called concurrency scenario testing which relies on system designer thinks of interesting concurrency scenar- tematic and exhaustive testing We have implemented a ios. For instance, an interesting scenario for a device tool called CHESS for performing concurrency scenario driver is concurrently receiving both an I/O message and testing of systems programs. CHESS uses model check- a shutdown message. These scenarios are system spe- ing techniques to systematically generate all interleav- cific and intuitively correspond to unit-tests for sequen- ing of a given scenario. CHESS scales to large concur- tial programs, though we expect concurrent scenarios to rent programs and has found numerous previously un- be more complex. known bugs in systems that had been stress tested for Given these scenarios, we use model checking [7, 31] many months prior to being tested by CHESS. For each techniques to systematically cover all thread schedules. bug, CHESS is able to consistently reproduce an erro- A model checker essentially captures the nondetermin- neous execution manifesting the bug, thereby making it ism of a system and then systematically enumerates all significantly easier to debug the problem. CHESS has possible choices. For a multithreaded process, this ap- been integrated into the test frameworks of many code proach is tantamount to running the system under a de- bases inside Microsoft and is being used by testers on a monic scheduler. We identify and address three key daily basis. challenges in making model checking applicable to large shared-memory multithreaded programs. 1 Introduction First, existing model checkers requires the program- mer to do a huge amount of work just to get started. We Given the importance and prevalence of concurrent sys- call this the ”perturbation problem.” Traditional methods tems,1 it is unfortunate that there exists no first-class for model checking are meant to formally verify abstract notion of concurrency testing. What is “concurrency models and do not work on real code. Other direct- testing”? In theory, concurrency testing is the process execution methods for checking multithreaded code re- of testing a concurrent system for correct behavior un- quire significant modifications to the system under test. der all possible schedules. In practice, people almost For instance, CMC [26, 39] requires running the Linux always identify concurrency testing with stress testing, kernel in user-space in order to analyze it, a non-trivial which evaluates the behavior of a concurrent system un- engineering effort. Tools like JPF [37] and ExitBlock [4] 2 force the user to use a specialized JVM. Other recent concurrency tests in its testing methodology (see Fig- approaches to model checking systems code, such as ure 1). Given an idempotent test, CHESS repeatedly ex- MaceMC [19], require writing the program in a new pro- ecutes the test in a loop, exploring a different schedule gramming language. in each iteration. In particular, CHESS does not need The perturbation problem makes model checking a to track any program state, including the initial state, non-starter for testers who want better control over the making it relatively easy to attach CHESS to an existing execution of concurrent programs. Whenever code is test. The test harness either exposes the TestStartup, changed for the benefit of testing, the tester knows that RunTestScenario, TestShutdown functions to the real bits are not being tested. Moreover, testers do not CHESS, or the tester introduces appropriate calls to have the luxury of changing code. Changing code takes CHESS in an existing test framework. time and is risky. Finally, testers already have extensive The only perturbation introduced by CHESS is a thin and varied test infrastructure in place and are unwilling to wrapper layer between the program under test and the “buy into” a different testing tool. Therefore, a concur- concurrency API (see Figure 2). This is required to cap- rency test tool should easily integrate with existing test ture and explore the nondeterminism inherent in the API. infrastructure with no modification to the system under We have developed a methodology for writing wrap- test and little modification to the test harness. pers that provides enough hooks to CHESS to control the Second, concurrency is enabled in most systems via thread scheduling without changing the semantics of the rich and complex concurrency APIs.2 For instance, the API functions and without modifying the underlying OS, Win32 API [25] used by most user-mode Windows pro- the API implementation, or the system under test. We are grams contains more than 200 threading and synchro- also able to map complex synchronization primitives into nization functions, many with different options and pa- simpler operations that greatly simplify the process of rameters. We wish to wrap the concurrency APIs to cap- writing these wrappers. We have validated our methodol- ture and control the nondeterminism inherent in the API, ogy by building wrappers for three different platforms— without changing the underlying OS scheduler or reim- Win32, .NET, and Singularity. plementing the synchronization primitives of the API. Finally, CHESS uses a variety of techniques, discussed Again, the principle of minimal perturbation is key to in Section 4 to address the state-explosion problem. The making model checking an effective test tool. CHESS scheduler is non-preemptive by default, giving it Finally, we have the classic problem of state-space the ability to execute large bodies of code atomically. Of explosion. The number of thread interleavings even course, a non-preemptive scheduler is at odds with the for small systems can be astronomically large. Scal- fact that a real scheduler may preempt a thread at just ing model checking to such large systems essentially re- about any point in its execution. Pragmatically, CHESS quires various techniques to focus the systematic enu- explores thread schedules giving priority to schedules meration to interesting parts of the state space that are with fewer preemptions. The intuition behind this search more likely to contain bugs. Recent work [14, 26, 39, strategy, called preemption bounding [27], is that many 19] has successfully applied model checking to coarse- bugs are exposed in multithreaded programs by a few grained message-passing systems. However, it is not preemptions occurring in particular places in program clear how to adapt such techniques for shared-memory execution. To scale to large systems, we had to im- multithreaded programs with fine-grained concurrency. prove upon preemption bounding in several important In this paper, we address all these challenges with ways. First, we not only restrict preemptions at synchro- a tool called CHESS.CHESS enables systematic test- nization points (calls to synchronization primitives in the ing and debugging of three classes of multithreaded concurrency API) we also only preempt at volatile vari- programs—user-mode Win32 programs, .NET pro- ables that participate in a data race. Second, we give grams, and Singularity [16] applications. The main con- the tester the ability to control the components to which tribution of CHESS is that it eliminates the nondeter- preemptions are added (and conversely the components minism inherent in multithreaded execution, systemati- which are treated atomically). This is critical for trusted cally exploring thread schedules in a deterministic man- libraries that are known to be thread-safe. ner and, when a bug is discovered, reproducing the exact This paper is the culmination of a three-year effort in thread schedule that led to the bug. applying model checking techniques to comprehensively To solve the perturbation problem, we first observe test concurrent programs. Our testing methodology has that most concurrency tests have been designed to run been validated by the successful integration of CHESS repeatedly millions of times for the purpose of stress into the test frameworks of several codebases inside Mi- testing. This means that at the end of a concurrency crosoft (x5). Using CHESS, we have found 25 unknown test, all allocated resources are freed and any global bugs; each bug was found by running an existing stress state is reset. CHESS leverages this idempotency of tests, modified to run with smaller number of threads. 3 TestStartup(); Moreover, CHESS produced a consistently reproducible execution for each bug, greatly simplifying the debug- Chess.Quiesce(); ging process.

Load more