Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking

Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking Jason F. Cantin, Mikko H. Lipasti, and James E. Smith Department of Electrical and Computer Engineering University of Wisconsin, Madison {jcantin, mikko, jes}@ece.wisc.edu Abstract caused by broadcasts, high performance multiprocessor To maintain coherence in conventional shared-memory systems decouple the coherence mechanism from the data multiprocessor systems, processors first check other proc- transfer mechanism, allowing data to be moved directly essors’ caches before obtaining data from memory. This from a memory controller to a processor either over a coherence checking adds latency to memory requests and separate data network [1, 2, 3], or separate virtual chan- leads to large amounts of interconnect traffic in broadcast- nels [4]. This approach to dividing data transfer from based systems. Our results for a set of commercial, scien- coherence enforcement has significant performance poten- tific and multiprogrammed workloads show that on tial because the broadcast bottleneck can be sidestepped. average 67% (and up to 94%) of broadcasts are unneces- Many memory requests simply do not need to be broadcast sary. to the entire system, either because the data is not currently Coarse-Grain Coherence Tracking is a new technique shared, the request is an instruction fetch, the request that supplements a conventional coherence mechanism writes modified data back to memory, or the request is for and optimizes the performance of coherence enforcement. non-cacheable I/O data. The Coarse-Grain Coherence mechanism monitors the coherence status of large regions of memory, and uses that 1.1. Coarse-Grain Coherence Tracking information to avoid unnecessary broadcasts. Coarse- Grain Coherence Tracking is shown to eliminate 55-97% In this paper, we leverage the decoupling of the coherence of the unnecessary broadcasts, and improve performance and data transfer mechanisms by developing Coarse-Grain by 8.8% on average (and up to 21.7%). Coherence Tracking, a new technique that allows a processor to increase substantially the number of requests that can be sent directly to memory without a broadcast and without violating coherence. Coarse-Grain Coherence 1. Introduction Tracking can be implemented in an otherwise conventional multiprocessor system. A conventional cache coherence protocol (e.g., write-invalidate MOESI [5]) is employed to Cache-coherent multiprocessor systems have wide-ranging maintain coherence over the processors’ caches. However, applications from commercial transaction processing and unlike a conventional system each processor maintains a database services to large-scale scientific computing. They second structure for monitoring coherence at a granularity have become a critical component of internet-based ser- larger than a single cache line (Figure 1). This structure is vices in general. As system architectures have incorporated called the region coherence array (RCA), and maintains larger numbers of faster processors, the memory system coarse-grain coherence state over large, aligned memory has become critical to overall system performance and regions, where a region encompasses a power-of-two scalability. Improving both coherence and data bandwidth, number of conventional cache lines. and using them more efficiently, have become key design issues. On snoop requests, each processor’s RCA is snooped along with the cache line state, and the coarse-grain state is To maintain coherence and exploit fast cache-to-cache piggybacked onto the conventional snoop response. The transfers, multiprocessors commonly broadcast memory requesting processor stores this information in it’s RCA to requests to all the other processors in the system [1, 2, 3]. avoid broadcasting subsequent requests for lines in the While broadcasting is a quick and simple way to find region. As long as no other processors are caching data in cached data copies, locate the appropriate memory control- that region, requests for data in the region can go directly lers, and order memory requests, it consumes considerable to memory and do not require a broadcast. interconnect bandwidth and, as a byproduct, increases latency for non-shared data. To reduce the bottleneck Request / Response Network Cache Block (DCB) operations that invalidate, flush, or zero-out cached copies in the system. Most of these are Data Network Data Cache Block Zero (DCBZ) operations used by the AIX operating system to initialize physical pages. Region L2 Tag Data Data Coherence (Line L2 Cache Data Switch 94% 93% Array State) Network 100% Interface 89% 90% Data Data 76% 80% 67% Requests 60% 54% Requests 60% DRAM 32% Core Memory 40% Controller Requests 15% Processor Chip 20% 0% Figure 1. Processor node modified to implement e 9 H B W an ate Coarse-Grain Coherence. A Region Coherence e rac r b9 C- C- C- yt P P P Mean Oc Barnes we T T T c Array is added, and the network interface may Ra C meti need modification to send requests directly to the SPE h ECint2000SPECjbb2000 P Arit memory controller. S DCB I-Fetch Read Write Write-back As an example, consider a shared-memory multiprocessor with two-levels of cache in each processor. One of the Figure 2. Unnecessary broadcasts in a four- processors, processor A, performs a load operation. The processor system. From 15% to 94% of requests load misses in the L1 cache, and a read request is sent to could have been handled without a broadcast. the L2 cache. The L2 cache coherence state and the region coherence state are read in parallel to determine the status If a significant number of the unnecessary broadcasts can of the line. There is a miss in the L2 cache, and the region be eliminated in practice, there will be large reductions in state is invalid, so the request is broadcast. Each other traffic over the broadcast interconnect mechanism. This processor’s cache is snooped, and the external status of the will reduce overall bandwidth requirements, queuing de- region is sent back to processor A with the conventional lays, and cache tag lookups. Memory latency can be snoop response. Because no processors were caching data reduced because many data requests will be sent directly to from the region, an entry for the region is allocated in memory, without first going to an arbitration point and processor A’s RCA with an exclusive state for the region. broadcasting to all coherence agents. Some requests that Until another processor makes a request for a cache line in do not require a data transfer, such as requests to upgrade a that region, processor A can access any memory location shared copy to a modifiable state and DCB operations, can in the region without a broadcast. be completed immediately without an external request. As we will show, Coarse-Grain Coherence Tracking 1.2. Performance Potential does, in fact, eliminate many of the unnecessary broadcasts and provides the benefits just described. In effect, it en- Figure 2 illustrates the potential of Coarse-Grain Coher- ables a broadcast-based system to achieve much of the ence Tracking. The graph shows the percentage of benefit of a directory-based system (low latency access to unnecessary broadcast requests for a set of workloads on a non-shared data, lower interconnect traffic, and improved simulated four-processor PowerPC system. Refer to Sec- scalability) without the disadvantage of three-hop cache- tion 4 for the system parameters and a description of the to-cache transfers. It exploits spatial locality, but maintains workloads used. On average, 67% of the requests could a conventional line size to avoid increasing false-sharing, have been handled without a broadcast if the processor had fragmentation, and transfer costs. oracle knowledge of the coherence state of other caches in the system. The largest contribution is from ordinary reads 1.3. Paper Overview and writes (including prefetches) for data that is not shared at the time of the request. The next most significant con- This paper presents a protocol to implement Coarse-Grain tributor is write-backs, which generally do not need to be Coherence Tracking and provides simulation results for a seen by other processors. These are followed by instruc- broadcast-based multiprocessor system running commer- tion fetches, for which the data is usually clean-shared. cial, scientific, and multiprogrammed workloads. The The smallest contributor, although still significant, is Data paper is organized as follows. Related work is surveyed in the next section. Section 3 discusses the implementation of and are limited to virtual-page-sized regions of mem- Coarse-Grain Coherence Tracking, along with a proposed ory [14]. coherence protocol. Section 4 describes evaluation meth- Moshovos et al. proposed Jetty, a snoop-filtering odology, which is followed by simulation results in mechanism for reducing cache tag lookups [15]. This Section 5. Section 6 concludes the paper and presents technique is aimed at saving power by predicting whether opportunities for future work. an external snoop request is likely to hit in the local cache, avoiding unnecessary power-consuming cache tag look- 2. Related Work ups. Like our work, Jetty can reduce the overhead of maintaining coherence; however Jetty does not avoid send- Sectored caches provide management of cache data at ing requests and does not reduce request latency. granularities larger than a single line. Sectored caches Moshovos has concurrently proposed a technique called reduce tag overhead by allowing a number of contiguous RegionScout that is based on Jetty, and avoids sending lines to share the same tag [6, 7] (see also the references snoop requests as well as avoiding tag lookups for incom- in [8]). However, the partitioning of a cache into sectors ing snoops [16]. RegionScout uses less precise can increase the miss rate significantly for some applica- information, and hence can be implemented with less tions because of increased internal fragmentation [7, 8, 9]. storage overhead and complexity than our technique, but at There have been proposals to fix this problem, including the cost of effectiveness.

Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking

Shared Memory Multiprocessors

Introduction to Multi-Threading and Vectorization Matti Kortelainen Larsoft Workshop 2019 25 June 2019 Outline

Detecting False Sharing Efficiently and Effectively

Decompose and Conquer: Addressing Evasive Errors in Systems on Chip

A Design Methodology for Soft-Core Platforms on FPGA with SMP Linux

The POWER4 Processor Introduction and Tuning Guide

Multi-Threading for Multi-Core Architectures

Intel Hyper-Threading Technology

PREDATOR: Predictive False Sharing Detection

Embedded Multicore: an Introduction

AMC: Advanced Multi-Accelerator Controller

Coherent Shared Memories for Fpgas by David Woods a Thesis