Detecting False Sharing Efficiently and Effectively

W&M ScholarWorks Arts & Sciences Articles Arts and Sciences 2016 Cheetah: Detecting False Sharing Efficiently andff E ectively Tongping Liu Univ Texas San Antonio, Dept Comp Sci, San Antonio, TX 78249 USA; Xu Liu Coll William & Mary, Dept Comp Sci, Williamsburg, VA 23185 USA Follow this and additional works at: https://scholarworks.wm.edu/aspubs Recommended Citation Liu, T., & Liu, X. (2016, March). Cheetah: Detecting false sharing efficiently and effectively. In 2016 IEEE/ ACM International Symposium on Code Generation and Optimization (CGO) (pp. 1-11). IEEE. This Article is brought to you for free and open access by the Arts and Sciences at W&M ScholarWorks. It has been accepted for inclusion in Arts & Sciences Articles by an authorized administrator of W&M ScholarWorks. For more information, please contact [email protected]. Cheetah: Detecting False Sharing Efficiently and Effectively Tongping Liu ∗ Xu Liu ∗ Department of Computer Science Department of Computer Science University of Texas at San Antonio College of William and Mary San Antonio, TX 78249 USA Williamsburg, VA 23185 USA [email protected] [email protected] Abstract 1. Introduction False sharing is a notorious performance problem that may Multicore processors are ubiquitous in the computing spec- occur in multithreaded programs when they are running on trum: from smart phones, personal desktops, to high-end ubiquitous multicore hardware. It can dramatically degrade servers. Multithreading is the de-facto programming model the performance by up to an order of magnitude, significantly to exploit the massive parallelism of modern multicore archi- hurting the scalability. Identifying false sharing in complex tectures. However, multithreaded programs may suffer from programs is challenging. Existing tools either incur signifi- various performance issues caused by complex memory hier- cant performance overhead or do not provide adequate infor- archies [21, 23, 29]. Among them, false sharing is a common mation to guide code optimization. flaw that can significantly hurt the performance and scala- To address these problems, we develop Cheetah, a pro- bility of multithreaded programs [4]. False sharing occurs filer that detects false sharing both efficiently and effectively. when different threads, which are running on different cores Cheetah leverages the lightweight hardware performance with private caches, concurrently access logically indepen- monitoring units (PMUs) that are available in most mod- dent words in the same cache line. When a thread modi- ern CPU architectures to sample memory accesses. Cheetah fies the data of a cache line, the cache coherence protocol develops the first approach to quantify the optimization po- (managed by hardware) automatically invalidates the dupli- tential of false sharing instances without actual fixes, based cates of this cache line residing in the private caches of other on the latency information collected by PMUs. Cheetah pre- cores [25]. Thus, even if other threads access completely dif- cisely reports false sharing and provides insightful optimiza- ferent words inside this cache line, they have to reload the tion guidance for programmers, while adding less than 7% entire cache line from the shared cache or main memory. runtime overhead on average. Cheetah is ready for real de- Unnecessary cache coherence caused by false sharing can ployment. dramatically degrade the performance of multithreaded programs, by up to an order of magnitude [4]. A concrete exam- Categories and Subject Descriptors D.2.8 [Software En- ple shown in Figure 1 also illustrates this catastrophic per- gineering]: Metrics–Performance measures; D.1.3 [Pro- formance issue. We meant to employ multiple threads to ac- gramming Techniques]: Concurrent Programming–Parallel celerate the computation. However, when eight threads (on Programming an eight-core machine) simultaneously access adjacent ele- General Terms Measurement, Performance ments of array sharing the same cache line, this program runs ∼ 13× slower (black bars) than its linear-speedup ex- Keywords Multithreading, False Sharing, Performance Pre- pectation (grey bars). The hardware trend, such as adding diction, Address Sampling, Lightweight Profiling more cores on chip and enlarging the cache line size, will ∗ Both Tongping Liu and Xu Liu are co-first authors. further degrade the performance of multithreaded programs due to false sharing. Unlike true sharing, false sharing is generally avoidable. Permission to make digital or hard copies of all or part of this work for personal or When multiple threads unnecessarily share the same cache classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation line, we can add byte paddings or utilize thread-private vari- on the first page. Copyrights for components of this work owned by others than ACM ables so that different threads can be forced to access differ- must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a ent cache lines. Although the solution of fixing false sharing fee. Request permissions from [email protected]. problems is somewhat straightforward, finding them is dif- CGO’16, March 12–18, 2016, Barcelona, Spain c 2016 ACM. 978-1-4503-3778-6/16/03...$15.00 ficult and even impossible with manual checking, especially http://dx.doi.org/10.1145/2854038.2854039 for a program with thousands or millions of lines of code. 1 Expectation! Reality! 120! 100! ! 80! int array[total];! 60! int window=total/numThreads; ! void threadFunc(int start) ! 40! {! Runtime (S) for(index=start; index<start+window; index++)! 20! for(j=0; j<10000000; j++)! 0! array[index]++;! 1! 2! 4! 8! }! Number of Threads! (a) (a) A Program with False Sharing (b) (b) Performance Degradation Figure 1. (a) A false sharing example inside a multithreaded program (b) causes 13× performance degradation on an 8-core machine. Thus, it is important to employ tools to pinpoint false shar- Cheetah'Run*me' ing and provide insightful optimization guidance. However, existing general-purpose tools do not provide FS'Detec*on' FS'Assessment' Link' enough details about false sharing [7, 11, 21]. Existing false Applica*on' sharing detectors fall short in several ways. First, most tools Data'Collec*on' FS'Report' cannot distinguish true and false sharing, or require substan- User'Space' tial manual effort to identify optimization opportunities [9, 12, 13, 16, 18, 24, 26, 30, 31]. Second, tools [9, 18, 20, 28, 32] Driver' Opera*ng'System' based on memory access instrumentation introduce high runtime overhead, hindering their applicability to real, long- run applications. Third, some tools either require OS exten- PMU' Hardware' sions [24], or only work on special applications [19]. Fourth, Figure 2. Overview of Cheetah’s components (shadow no prior tools assess the performance gain from eliminating blocks), where “FS” is the abbreviation for false sharing. an identified false sharing bottleneck. Without this information, many optimization efforts may yield trivial or no performance improvement [19, 32]. Cheetah is designed to address all these issues with the Figure 2 shows the overview of Cheetah. The “data collec- following two contributions: tion” module gleans memory accesses via the address sampling supported by the hardware performance monitoring units (PMUs) and, with the assistance of the “driver” module, • The First Approach to Predict False Sharing Impact. filters out memory accesses associated with heap or globals, This paper introduces the first approach to predict the feeding them into the “FS detection” module. At the end of potential performance impact of fixing false sharing in- an application, or when Cheetah is requested to report, the stances, without actual fixes. Based on our evaluation, “FS detection” module examines the number of cache inval- Cheetah can precisely assess the performance improve- idations of each cache line, differentiates false sharing from ment, with less than 10% difference. By ruling out trivial true sharing, and passes false sharing instances to the “FS as- instances, we can avoid unnecessary manual effort lead- sessment” module. In the end, the “FS report” module only ing to little or no performance improvement. reports false sharing instances with a significant performance • An Efficient and Effective False Sharing Detector. impact, along with its predicted performance improvement Cheetah is an efficient false sharing detector, with only after fixes provided by the “FS assessment” module. ∼7% performance overhead. Cheetah utilizes the per- The remainder of this paper is organized as follows. Sec- formance monitoring units (PMUs) that are available in tion 2 describes false sharing detection components of Chee- modern CPU architectures to sample memory accesses. tah. Section 3 discuesses how Cheetah assesses the perfor- Cheetah provides sufficient information on false sharing mance impact of a false sharing instance. Section 4 presents problems, by pointing out the lines of code, names of experimental results, including effectiveness, performance variables, and detailed memory accesses involved in the overhead, and assessment precision. Section 5 addresses false sharing. Cheetah is a runtime library that is very main concerns about hardware dependence, performance, convenient to be deployed; there is no need for a custom and effectiveness. Lastly, Section

Detecting False Sharing Efficiently and Effectively

Memory Interference Characterization Between CPU Cores and Integrated Gpus in Mixed-Criticality Platforms

A Cache-Efficient Sorting Algorithm for Database and Data Mining

How to Solve the Current Memory Access and Data Transfer Bottlenecks: at the Processor Architecture Or at the Compiler Level?

Shared Memory Multiprocessors

Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline

Prefetching for Complex Memory Access Patterns

Eth-28647-02.Pdf

2.2 Adaptive Routing Algorithms and Router Design 20

Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs Samuel K

Leaper: a Learned Prefetcher for Cache Invalidation in LSM-Tree Based Storage Engines

Decompose and Conquer: Addressing Evasive Errors in Systems on Chip

A Design Methodology for Soft-Core Platforms on FPGA with SMP Linux