ARCHER: Effectively Spotting Data Races in Large Openmp Applications

ARCHER: Effectively Spotting Data Races in Large OpenMP Applications Simone Atzeni, Ganesh Gopalakrishnan, Dong H. Ahn, Ignacio Laguna, Joachim Protze, Matthias S. Müller Zvonimir Rakamarić Martin Schulz, Gregory L. Lee RWTH Aachen University University of Utah Lawrence Livermore National Laboratory Aachen, Germany Salt Lake City, UT, United States Livermore, CA, United States {protze, mueller}@itc.rwth-aachen.de {simone, ganesh, zvonimir}@cs.utah.edu {ahn1, lagunaperalt1, schulzm, lee218}@llnl.gov Abstract—OpenMP plays a growing role as a portable pro- In this paper we describe ARCHER, our new OpenMP gramming model to harness on-node parallelism; yet, existing data race detector, its unique capabilities in terms of data race checkers for OpenMP have high overheads and scalability and precision, its use of a proposed standard, generate many false positives. In this paper, we propose the and our philosophy of building on well-engineered open- first OpenMP data race checker, ARCHER, that achieves high accuracy, low overheads on large applications, and portability. source software. While the core concept of a data race ARCHER incorporates scalable happens-before tracking, ex- has been known for decades (uncoordinated, i.e., not ploits structured parallelism via combined static and dynamic separated by a happens-before edge, accesses on a mem- analysis, and modularly interfaces with OpenMP runtimes. ory location by two threads, with one access being a ARCHER significantly outperforms TSan and IntelrInspector XE, while providing the same or better precision. It has helped write), transitioning this idea into HPC practice required detect critical data races in the Hypre library that is central adherence to four key tenets. to many projects at Lawrence Livermore National Laboratory (1) Scalable Happens-Before Tracking Methods: Check- and elsewhere. ing for races in production OpenMP programs requires Keywords-data race detection; OpenMP; high performance the ability to track a huge number of memory references computing; static analysis; dynamic analysis; and their happens-before ordering. A significant amount of ARCHER’s scalability stems from its exploitation of I. INTRODUCTION a pre-existing tool—namely ThreadSanitizer (TSan) [6]. High performance computing (HPC) is undergoing an TSan’s unique architecture enables it to implement the explosion in raw computing capabilities as evidenced by idea of vector-clock-based race checking far more effi- recent announcements of next-generation computing sys- ciently than comparable tools do. Embracing TSan and tem projects [1]–[3]. To meet the stipulated performance its LLVM-based tooling approach enables us to write and power budgets, many key software components in custom LLVM passes, and in general take advantage of these projects are being transitioned to adopt on-node the growing popularity of LLVM in HPC [7]. Previous parallelism to a greater degree. The predominant pro- OpenMP data race checking tools were never released gramming model of choice in this transition is OpenMP— for public evaluation, were based on binary instrumen- due in large part to its portability and ease of use. We tation through PIN [8], or employed symbolic meth- are working with computational scientists at Lawrence ods [9]. These approaches are neither scalable nor widely Livermore National Laboratory (LLNL), one of the world’s portable. ARCHER has been publicly released under the largest computing facilities, where many of our mission- BSD License [10]. (Note: TSan was originally designed for critical multiphysics applications [4] are being ported to PThread and Go programs, and cannot be directly applied exploit OpenMP. to OpenMP programs as will soon be described.) We find, however, that efficient and scalable develop- (2) Static/Dynamic Analysis of Structured Parallelism: ment tools for OpenMP are still quite scarce, making In ARCHER, we capitalize on OpenMP’s structured paral- development efforts hard. In particular, none of the pre- lelism to support two key features never before exploited existing OpenMP data race checkers is capable of han- in an OpenMP race checker. First, we exploit OpenMP’s dling the code sizes involved, or provides effective debug- structured parallelism to easily write LLVM passes that ging support for concurrency bugs. Meanwhile, libraries identify guaranteed sequential regions within OpenMP. such as Hypre [5], which underlie many critical applica- Such analysis would be difficult to conduct in the context tions, have run into data races during this transition. In of unstructured parallelism (e.g., PThreads). Second, we one LLNL application, because of this lack of debugging identify and suppress parallel loops from race checking. tools, developers who faced these races even took the ARCHER achieves this by black-listing accesses within draconian approach of reverting back to sequential code. parallelizable loops with the help of a static analysis. (3) Modular Interfacing with OpenMP Runtimes: While passes [17]–[19] help classify the given OpenMP code structured parallelism has been exploited in the context regions into two categories: guaranteed race-free and of Java-like languages (e.g., Habanero Java [11]), such ex- potentially racy. Our dynamic analysis then applies state- ploitation in the context of OpenMP and ARCHER required of-the-art data race detection algorithms [20], [21] to a combination of innovations. Unlike in languages such check only the potentially racy OpenMP regions of code. as Habanero Java where the language and the runtime The static/dynamic analysis combination is central to are designed together, in OpenMP vendors provide their the scalability (while maintaining analysis precision) of own custom runtimes. Tools, such as TSan, must be suit- ARCHER, as evidenced by its ability to handle real-world ably modified to ignore OpenMP internal actions, which examples that existing tools cannot handle with the same may otherwise be falsely assumed to be data races [7]. levels of precision and scalability (see Section III-B). ARCHER’s approach is architected based on the OMPT As described earlier, we implemented ARCHER using the standard [12] so that our solutions may modularly be LLVM/Clang tool infrastructure [22], [23] and the TSan incorporated with multiple OpenMP runtimes. dynamic race checker [6]. On the static analysis side, (4) Collaboration with Active Projects: ARCHER has ARCHER uses Polly [19] to perform data dependency and already made significant impact within LLNL. As one loop-carried data dependency analysis (together called example, HYDRA [13] is a large multiphysics application data dependency analysis from now on). This results in a developed at LLNL, which is used for simulations at the Parallel Blacklist.ARCHER also extends some of the static National Ignition Facility (NIF) [14] and other high en- analysis passes already present in LLVM. Specifically, our ergy density physics facilities. It comprises many physics extension builds a call graph and traverses it to identify packages (e.g., radiation transfer, atomic physics, and memory accesses that do not come from within an hydrodynamics), and although all of them use MPI, a OpenMP construct (i.e., sequential code regions). This subset of them use thread-level parallelism (OpenMP and results in a Sequential Blacklist. These blacklists are com- PThreads) in addition to MPI. It has over one million bined and used to limit the instrumentation in TSan. lines of code and a development lifetime that exceeds 20 On the dynamic analysis side, ARCHER uses our cus- years. In the summer of 2013, developers began porting tomized version of TSan to detect data races at run- HYDRA to Sequoia [15], the over 1.5 million core IBM time. To prevent TSan from being confused by OpenMP Blue Gene/Q-based system that had just been brought runtime internal actions (and falsely report them as online at that time. Although the efforts included incor- OpenMP-level races), ARCHER employs TSan’s Annotation porating more threading for performance, the developers API to highlight these synchronization points within LLVM got significantly impeded when they could not resolve a OpenMP Runtime (the runtime presently associated with non-deterministic crash on an OpenMP-threaded version ARCHER in our studies). As we have already pointed out, of Hypre [5] (used by one of HYDRA’s scientific packages). our efforts are being migrated to adhere to the OMPT The developers found it very difficult to debug this standard.1 error that occurred intermittently after varying numbers of time steps, only at large scales (at or above 8192 A. Static Analysis Phase MPI processes), and only under compiler optimizations. We now detail some of the finer details of our static After spending considerable amounts of time, the team analysis, including feeding the blacklist information to suspected the presence of a data race within Hypre, but the TSan runtime. TSan carries out its dynamic data the difficulties in debugging and time pressure forced race detection by first instrumenting all the load and them to work around the issue by selectively disabling store actions of a program at compile time, and using OpenMP in Hypre. When ARCHER was brought onto the this instrumentation to help track happens-before. TSan scene, it located “benign races” involving two threads also provides a feature that allows users to blacklist racing to write the same value to the same location— functions [24] (by their name) that should not be in- a practice known to be dangerous in the presence

Load more