PREDATOR: Predictive False Sharing Detection

PREDATOR: Predictive False Sharing Detection Tongping Liu Chen Tian Ziang Hu Emery D. Berger School of Computer Science Huawei US R&D Center School of Computer Science University of Massachusetts Amherst [email protected], University of Massachusetts Amherst [email protected] [email protected] [email protected] Abstract General Terms Performance, Measurement False sharing is a notorious problem for multithreaded ap- Keywords False Sharing, Multi-threaded plications that can drastically degrade both performance and scalability. Existing approaches can precisely identify the 1. Introduction sources of false sharing, but only report false sharing actu- While writing correct multithreaded programs is often chal- ally observed during execution; they do not generalize across lenging, making them scale can present even greater obsta- executions. Because false sharing is extremely sensitive to cles. Any contention can impair scalability or even cause ap- object layout, these detectors can easily miss false sharing plications to run slower as the number of threads increases. problems that can arise due to slight differences in mem- False sharing is a particularly insidious form of con- ory allocation order or object placement decisions by the tention. It occurs when two threads update logically-distinct compiler. In addition, they cannot predict the impact of false objects that happen to reside on the same cache line. The sharing on hardware with different cache line sizes. resulting coherence traffic can degrade performance by an This paper presents PREDATOR, a predictive software- order of magnitude [4]. Unlike sources of contention like based false sharing detector. PREDATOR generalizes from locks, false sharing is often invisible in the source code, a single execution to precisely predict false sharing that is making it difficult to find. latent in the current execution. PREDATOR tracks accesses As cache lines have grown larger and multithreaded ap- within a range that could lead to false sharing given different plications have become commonplace, false sharing has object placement. It also tracks accesses within virtual cache become an increasingly important problem. Performance lines, contiguous memory ranges that span actual hardware degradation due to false sharing has been detected across cache lines, to predict sharing on hardware platforms with the software stack, including inside the Linux kernel [5], the larger cache line sizes. For each, it reports the exact pro- Java virtual machine [8], common libraries [19] and widely- gram location of predicted false sharing problems, ranked by used applications [20, 23]. their projected impact on performance. We evaluate PREDA- Recent work on false sharing detection falls short in TOR across a range of benchmarks and actual applications. several dimensions. Some introduce excessive performance PREDATOR identifies problems undetectable with previous overhead, making them impractical [9, 16, 26]. Most do not tools, including two previously-unknown false sharing prob- report false sharing precisely and accurately [9–11, 16, 24, lems, with no false positives. PREDATOR is able to immedi- 28], and some require special OS support or only work on a ately locate false sharing problems in MySQL and the Boost restricted class of applications [17, 21]. library that had eluded detection for years. In addition, all of these systems share one key limitation: they can only report observed cases of false sharing. As Categories and Subject Descriptors D.1.3 [Software]: Nanavati et al. point out, false sharing is sensitive to where Concurrent Programming–Parallel Programming; D.4.8 objects are placed in cache lines and so can be affected by a [Software]: Operating Systems–Performance wide range of factors [21]. For example, using the gcc compiler accidentally eliminates false sharing in the Phoenix linear_regression benchmark at certain optimization levels, Permission to make digital or hard copies of all or part of this work for personal or while LLVM does not do so at any optimization level. A classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation slightly different memory allocation sequence (or different on the first page. Copyrights for components of this work owned by others than ACM memory allocator) can reveal or hide false sharing, depend- must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a ing on where objects end up in memory; using a different fee. Request permissions from [email protected]. hardware platform with different addressing or cache line PPoPP ’14, February 15–19, 2014, Orlando, Florida, USA. Copyright c 2014 ACM 978-1-4503-2656-8/14/02. $15.00. sizes can have the same effect. All of this means that ex- http://dx.doi.org/10.1145/2555243.2555244 isting tools cannot root out potentially devastating cases of false sharing that could arise with different inputs, in differ- tion 2.2), and the runtime system collects and analyzes these ent execution environments, and on different hardware plat- accesses to detect and report false sharing (see Section 2.3). forms. This paper makes the following contributions: 2.2 Compiler Instrumentation PREDATOR relies on LLVM to perform instrumentation at • Predictive False Sharing Detection: This paper intro- the intermediate representation level [15]. It traverses all duces predictive false sharing analysis, an approach that functions one by one and searches for memory accesses to can predict potential false sharing that does not manifest global and heap variables. For each memory access, PREDA- in a given run but may appear—and greatly degrade ap- TOR inserts a function call to invoke the runtime system with plication performance—in a slightly different execution the memory access address and access type (read or write). environment. Predictive false sharing detection thus over- PREDATOR currently omits accesses to stack variables by comes a key limitation of previous detection tools. default because stack variables are normally used for thread • A Practical and Effective Predictive False Sharing De- local storage and therefore do not normally introduce false tector: This paper presents PREDATOR, a prototype pre- sharing. However, instrumentation on stack variables can al- dictive false sharing detector that combines compiler- ways be turned on if desired. based instrumentation with a runtime system. PREDATOR The instrumentation pass is placed at the very end of the not only detects but also predicts potential false sharing LLVM optimization passes so that only those memory ac- problems. PREDATOR operates with reasonable overhead cesses surviving all previous LLVM optimization passes are (average: 6× performance, 2× memory). It is the first instrumented. This technique, which can drastically reduce false sharing tool able to automatically and precisely un- the number of instrumentation calls, is similar to the one cover false sharing problems in real applications, includ- used by AddressSanitizer [27]. ing MySQL and the Boost library. 2.3 Runtime System 2. False Sharing Detection PREDATOR’s runtime system collects every memory ac- We first describe PREDATOR’s false sharing detection mech- cess via the functions calls inserted by the compiler’s in- anism, which comprises both compiler and runtime system strumentation phase. It analyzes possible cache invalida- components. Section 3 then explains how PREDATOR pre- tions due to possibly interleaved reads and writes. Finally, dicts potential false sharing based on a single execution. PREDATOR precisely reports any performance-degrading false sharing problems it finds. For global variables involved 2.1 Overview in false sharing, PREDATOR reports their name, address and False sharing occurs when two threads simultaneously ac- size; for heap objects, PREDATOR reports the callsite stack cess logically independent data in the same cache line, and for their allocations, their address and size. In addition, where at least one of the accesses is a write. For the purposes PREDATOR provides word granularity access information of exposition, we assume that each thread runs on a distinct for those cache lines involved in false sharing, including core with its own private cache. which threads accessed which words. This information can We observe that if a thread writes a cache line after other further help users diagnose and fix false sharing instances. threads have accessed the same cache line, this write oper- ation most likely causes at least one cache invalidation. It 2.3.1 Tracking Cache Invalidations is this invalidation traffic that leads to performance degrada- PREDATOR only reports those global variables or heap ob- tion due to false sharing. To identify the root cause of such jects on cache lines with a large number of cache invalida- traffic due to false sharing, PREDATOR tracks cache invali- tions. It is critical that PREDATOR track cache invalidations dations of all cache lines, and ranks the severity of perfor- precisely in order to provide accurate reports of the location mance degradation of any detected false sharing problems of false sharing instances. PREDATOR achieves this goal by according to the number of cache invalidations. maintaining a two entry cache history table for every cache To track cache invalidations, PREDATOR relies on com- line. In this table, each entry has two fields: the thread ID and piler instrumentation to track accesses to memory. While a access type (read or write). The thread ID is used to identify compiler can easily identify read or write accesses, it cannot the origin of each access. As stated earlier, only accesses know how and when those instructions are being executed, from different threads can cause cache invalidations. since that depends on a specific execution, input, and run- For every new access to a cache line L,PREDATOR checks time environment. L’s history table T to decide whether there is a cache inval- Therefore, PREDATOR combines compiler instrumenta- idation based on the following rules.

PREDATOR: Predictive False Sharing Detection

Shared Memory Multiprocessors

Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline

2.2 Adaptive Routing Algorithms and Router Design 20

Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs Samuel K

Detecting False Sharing Efficiently and Effectively

Leaper: a Learned Prefetcher for Cache Invalidation in LSM-Tree Based Storage Engines

Decompose and Conquer: Addressing Evasive Errors in Systems on Chip

A Design Methodology for Soft-Core Platforms on FPGA with SMP Linux

CSC 256/456: Operating Systems

Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking

Improving Performance of the Distributed File System Using Speculative Read Algorithm and Support-Based Replacement Technique

Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2016 Tunes Bang Bang (My Baby Shot Me Down) Nancy Sinatra (Kill Bill Volume 1 Soundtrack)