Using Hardware Transactional Memory for Data Race Detection

Using Hardware Transactional Memory for Data Race Detection Shantanu Gupta Florin Sultan, Srihari Cadambi, Franjo Ivanciˇ c´ and Martin Rotteler¨ Department of EE and CS NEC Laboratories America University of Michigan Princeton, NJ [email protected] {cadambi, ivancic, mroetteler}@nec-labs.com Abstract—Widespread emergence of multicore proces- grams, however, is significantly harder than writing sors will spur development of parallel applications, ex- correct sequential programs. An insidious problem with posing programmers to degrees of hardware concurrency parallel programs is data races. Data races occur when hitherto unavailable. Dependable multithreaded software will have to rely on the ability to dynamically detect non- two threads of a program access a common memory deterministic and notoriously hard to reproduce synchro- location without synchronization and at least one of the nization bugs manifested through data races. Previous accesses is a write. Some data races cause no change in solutions to dynamic data race detection have required the program output, but others produce incorrect results. specialized hardware, at additional power, design and area Data races often indicate programming errors, and are costs. We propose RaceTM, a novel approach to data race detection that exploits hardware that will likely be present hard to detect primarily due to their non-deterministic in future multiprocessors, albeit for a different purpose. In nature. particular, we show how emerging hardware support for For this reason, automatic data race detection is ac- transactional memory can be leveraged to aid data race knowledged to be a difficult problem that has become detection. We propose the concept of lightweight debug extremely important in the light of ubiquitous parallel transactions that exploit the conflict detection mechanisms of transactional memory systems to perform data race programming. detection. We present a proof-of-concept simulation proto- Data race detection methods can be classified as either type, and evaluate it on data races injected into applications static or dynamic. Static race detection tools analyze dif- from the SPLASH-2 suite. Our experiments show that this ferent thread interleavings to detect possible data races. technique is effective at discovering data races and has low However, these techniques are either not scalable or too performance overhead. conservative, and sometimes compromise scalability for I. INTRODUCTION false positives. For example, it may not be feasible to check all possible thread interleavings in a reasonable Due to a variety of factors including production costs, amount of time. Hence, a static analyzer may make a reliability and power consumption, the computing indus- conservative guess, which results in data races being try is shifting to multi-core processing. Increased system reported when none exist. Similarly, the absence of a performance will have to be achieved mostly through lock protecting a shared variable does not necessarily higher numbers of cores per processor rather than gains imply a data race. Such false positives are an annoyance from clock speed increases. It is generally understood and may hamper productivity by causing the programmer that the only way to continue performance improvements to re-examine and attempt to debug portions of the code is by leveraging parallel programming. Some studies where no data races exist. On the other hand, in dynamic even indicate that while cores in future microprocessor race detection mechanisms based on locksets [26] or chips will be more numerous compared to the present, happened-before [13], information about the history of they may be slower and simpler due to power density an actual program execution is stored and analyzed at concerns. For instance, the cores may not be as deeply runtime. Such methods are faster but are restricted to pipelined, or may not have sophisticated out-of-order discovering only data races exhibited by or close to a execution mechanisms. particular execution. Parallel programming is thus necessary to continue Data race detection is slow when performed entirely future performance scaling. Writing correct parallel pro- in software. Hardware-assisted dynamic race detection F. Sultan was with NEC Laboratories America when this research mechanisms such as HARD [32] or CORD [21] are was carried out. He is now with Telcordia Research, NJ, USA. faster but require specialized hardware, an approach that is not cost-effective. This paper describes RaceTM, a system that leverages transactional memory hardware, likely to exist in future parallel microprocessors, in order to perform fast and efficient dynamic data race detection. To avoid the costs associated with building specialized hardware, we propose to reuse existing hardware structures, with minimal or no changes, in order to accelerate data race detection. We demonstrate how the same hardware that supports code with transactional critical sections can be used for detecting data races in an entire program. Transactional Memory (TM) [10] is a parallel programming model that many argue will become a stan- dard for programming future parallel systems, especially multicore processors. Under TM, a program is written in terms of atomic transactions that are speculatively executed. Two transactions are in conflict with one another when at least one of them writes to a memory location that has been accessed by the other. When a Fig. 1. Debug transactions (DTM) detect a race on Y outside regular transactions (TM). conflict is detected between two transactions, one of them is rolled back (i.e., changes it made to memory are not committed) and attempted again. The optimistic In [6] we introduced the concept of lightweight debug speculative execution allows for increased concurrency, transactions that span non-transactional code and exploit since in most cases two potentially conflicting threads the conflict detection mechanisms of a TM system will not actually conflict. The goal of TM systems is to detect data races. A programmer or compiler can to achieve the performance of fine-grained locks while use simple primitives to manipulate (start, stop, pause, providing the relative ease of programming achieved by resume) debug transactions during execution. Figure 1 using coarse-grained locks. provides a high-level view of how our system works. Additional hardware is not necessary to support trans- Data races occurring within transactions (labeled TM) actional memory, but software TM systems are es- are serialized by the TM system, while those occurring timated to be slower when compared to using fine- outside transaction boundaries, such as the one on vari- grained locks. Hence hardware-assisted TM systems [2], able Y, remain undetected. RaceTM automatically covers [7], [23], where specialized hardware performs conflict unprotected regions of code with debug transactions detection and rollback, have been proposed. As an indi- (labeled DTM), and uses the existing TM conflict detec- cation of the benefits of TM systems, Sun Microsystems tion mechanism to detect such races. In this paper, we has included support for transactional memory in its present an implementation of RaceTM based purely on a Rock processor [28]. hardware-based TM system with minimal changes to the Although TM is gaining popularity as a programming provided hardware support and coherence protocol. This model, not all the code of a program will be covered by paper also provides details on the debugging support transactions, for several reasons. First, I/O and certain provided and presents experimental results for a subset system calls are irreversible and cannot be undone, and of the SPLASH-2 [29] benchmarks. therefore cannot be executed speculatively and rolled- Debug transactions differ from regular transactions back later. Second, code fragments with large memory in that they do not need to be rolled-back, and hence footprints require large amounts of storage for check- need no checkpointing support. Further, unlike regular pointing. Converting such code into transactions not transactions, debug transactions do not change program only affects performance, but may overrun the limited semantics as they do not enforce atomicity. This enables resources of the underlying TM system [14]. Third, their use in sections of code not covered by transactions legacy code using locks will likely transition to TM by or where it is impossible or difficult to use TM. The replacing lock-protected critical sections with transac- concept of debug transactions does not depend on the tions [2], [24], [25]. This will leave large sections of the implementation of the underlying TM system, and is code exposed to data races. independent of the type of TM conflict detection and versioning scheme (lazy or eager, see Section II). We race detection tools to employ aliasing and points-to present a prototype of RaceTM based on LogTM [17], a analyses [11]. In contrast to static techniques, dynamic hardware TM system. We demonstrate that lightweight tools detect potential data races that occur at run-time, hardware support for data race detection is possible by regardless of memory safety issues and without an leveraging the same hardware that is required to support explicit up-front pointer aliasing analysis by observing TM. the actual memory state in the single execution trace that In addition to lightweight hardware support for data is being considered. race detection, our prototype provides race reporting B. Transactional Memory

Using Hardware Transactional Memory for Data Race Detection

Shared Memory Multiprocessors

Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline

Detecting False Sharing Efficiently and Effectively

Decompose and Conquer: Addressing Evasive Errors in Systems on Chip

A Design Methodology for Soft-Core Platforms on FPGA with SMP Linux

Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking

The POWER4 Processor Introduction and Tuning Guide

Multi-Threading for Multi-Core Architectures

Intel Hyper-Threading Technology

PREDATOR: Predictive False Sharing Detection

Embedded Multicore: an Introduction

AMC: Advanced Multi-Accelerator Controller