A Survey of Cache Coherence Schemes for Multiprocessors

A Survey of Cache Coherence-Schemes for Multiprocessors Per Stenstrom Lund University hared-memory multiprocessors as opposed to shared, because each cache have emerged as an especially cost- is private to one or a few of the total effective way to provide increased number of processors. computing power and speed, mainly be- Cache coherence The private cache organization appears cause they use low-cost microprocessors in a number of multiprocessors, including economically interconnected with shared schemes tackle the Encore Computer’s Multimax, Sequent memory modules. problem Computer Systems’ Symmetry, and Digi- Figure 1 shows a shared-memory multi- of tal Equipment’s Firefly multiprocessor processor consisting of processors con- maintaining data workstation. These systems use a common nected with the shared memory modules bus as the interconnection network. Com- by an interconnection network. This sys- consistency in munication contention therefore becomes tem organization has three problems1: shared-memory a primary concem, and the cache serves mainly to reduce bus contention. (1) Memory contention. Since a mem- multiprocessors. Other systems worth mentioning are ory module can handle only one memory RP3 from IBM, Cedar from the University request at a time, several requests from They rely on of Illinois at Urbana-Champaign, and different processors will be serialized. software, hardware, Butterfly from BBN Laboratories. These (2) Communication contention. Con- systems contain about 100 processors tention for individual links in the intercon- or a combination connected to the memory modules by a nection network can result even if requests multistage interconnection network with a are directed to different memory modules. of both. considerable latency. RP3 and Cedar also (3) Latency time. Multiprocessors with use caches to reduce the average memory a large number of processors tend to have access time. complex interconnection networks. The la- Shared-memory multiprocessors have tency time for such networks (that is, the an advantage: the simplicity of sharing time a memory request takes to traverse the code and data structures among the pro- network) is long. lows the cache to perform a vast majority cesses comprising the parallel application. of all memory requests (typically more Process communication, for instance, can These problems all contribute to increased than 95 percent); memory handles only a be implemented by exchanging informa- memory access times and hence slow down small fraction. It is therefore not surprising tion through shared variables. This sharing the processors’ execution speeds. that multiprocessor architects also have can result in several copies of a shared Cache memories have served as an im- employed cache techniques to address the block in one or more caches at the same portant way to reduce the average memory problems pointed out above. Figure 2 time. To maintain a coherent view of the access time in uniprocessors. The locality shows a multiprocessor organization with memory, these copies must be consistent. of memory references over time (temporal caches attached to all processors. This This is the cache coherenceproblem or the locality) and space (spatial locality) al- cache organization is often called private, cache consistency problem. A large num- 12 WIB-9162/wRMn-W12$01.W 0 1990 IEEE COMPUTER ber of solutions to this problem have been proposed. This article surveys schemes for cache coherence. These schemes exhibit various degrees of hardware complexity, ranging from protocols that maintain coherence in hardware to software policies that prevent the existence of copies of shared, writable data. First we’ll look at some examples of how shared data is used. These examples help point out a number of performance issues. Then we’ll look at hardware protocols. We’ll see that consistency can be maintained efficiently, but in some cases with considerable hardware complexity, especially for multiprocessors with many processors. We’ll investigate software schemes as an altemative capable of reducing the hardware cost. Figure 1. An example of a shared memory multiprocessor.’ Example of algorithms with data sharing Cache coherence poses a problem mainly for shared, read-write data structures. Read-only data structures (such as shared code) can be safely replicated without cache coherence enforcement mechanisms. Private, read-write data structures might impose a cache coherence problem Interconnection network if we allow processes to migrate from one processor to another. Many commercial multiprocessors help increase throughput for multiuser operating systems where user processes execute independently with no pq pq (or little) data sharing. In this case, we need ... to efficiently maintain consistency for pri- Ipi vate, read-write data in the context of proc- IPI ess migration. We will concentrate on the behavior of Figure 2. An example of a multiprocessor with private caches.’ cache coherence schemes for parallel applications using shared, read-write data structures. To understand how the schemes work and how they perform for different uses of shared data structures, we will Producer: Consumer: investigate two parallel applications that if count <= N then if count <> 0 then use shared data structures differently. mutexbegin mutexbegin These examples highlight a number of buffer[ in ] := item; item := buffer[ out 1; performance issues. in := in + 1 mod N; out := out + 1 mod N count :=count 1; We can find the first example - the count := count + 1; - well-known bounded-buffer producer and mutexend mutexend consumer problem - in any ordinary text on operating systems. Figure 3 shows it in IFigure 3. Pascal-like code for the bounded-buffer problem. a Pascal-like notation. The producer in- serts a data item in the shared buffer if the buffer is not full. The buffer can store N items. The consumer removes an item if the buffer is not empty. We can choose the three shared variables: in, out, and count, and mutexend) protect buffer operations, number of producers and consumers arbi- which keep track of the next item and the which means that one process at most can trarily. The buffer is managed by a shared number of items stored in the buffer. enter the critical section at a time. array, which implements the buffer, and Semaphores (implemented by mutexbegin The second example to consider is a June 1990 13 Cache coherence policies. Hardware- repeat based protocols for maintaining cache par-for J := 1 to N do coherence guarantee memory system co- begin herence without software-implemented xtemp[ J ] := b[ J 1; mechanisms. Typically, hardware mecha- for K := 1 to N do nisms detect inconsistency conditions and xtemp[ J 1 := xtemp[ J 1 + A[ J,K ] * x[ K 1; perform actions according to a hardware-. end; implemented protocol. barrier-sync; Data is decomposed into a number of par-for J := 1 to N do equally sized blocks. A block is the unit of x[ J I := xtemp[ J I; transfer between memory and caches. barrier-sync; Hardware protocols allow an arbitrary until false; number of copies of a block to exist at the same time. There are two policies for Figure 4. Pascal-like code for one iteration of the parallel algorithm for solving a maintaining cache consistency: write- linear system of equations by iteration. invalidate and write-update. The write-invalidate policy maintains consistency of multiple copies in the fol- lowing way: Read requests are carried out locally if a copy of the block exists. When parallel algorithm for solving a linear sys- (3) All elements of vector x are updated a processor updates a block, however, all tem of equations by iteration. It takes the in each iteration. other copies are invalidated. How this is form done depends on the interconnection net- With these examples in mind, we will work used. (Ignore it for the moment.) A x,+]= Ax, + b consider how the proposed schemes for subsequent update by the same processor cache coherence manage copies of the data can then be performed locally in the cache, where x,+!, x,, and b are vectors of size N structures. since copies no longer exist. Figure 5 and A is amatrix of size NxN. Suppose that Proposed solutions range from hard- shows how this policy works. In Figure 5a, each iteration (the calculation of vector ware-implemented cache consistency four copies of block X are present in the x<+Jis performed by N processes, where protocols, which give software a coherent system (the memory copy and three cached each process calculates one vector ele- view of the memory system, to schemes copies). In Figure 5b, processor 1 has ment. The code for this algorithm appears providing varied hardware support but updated an item in block X (the updated in Figure 4. The termination condition does with cache coherence enforcement poli- block is denotedx’) and all other copies are not concern us here. Therefore, we assume cies implemented in software. We will invalidated (denoted I). If processor 2 is- that it never terminates. focus on the implementation cost and per- sues a read request to an item in block X‘, The par-for statement initiates N pro- formance issues of the surveyed schemes. then the cache attached to processor 1 cesses. Each process calculates a new supplies it. value, which is stored in xtemp. The last The write-update policy maintains con- parallel loop in the iteration copies back Hardware-based sistency differently. Instead of invalidat- the elements of xtemp to vector x. This protocols ing all copies, it updates them as shown in requires a barrier synchronization. The Figure 5c. Whether the memory copy is most important observations are Hardware-based protocols include updated or not depends on how this proto- snoopy cache protocols, directory col is implemented. We will look at that (1) Vector b and matrix A are read- schemes, and cache-coherent network later. shared and can be safely cached.

A Survey of Cache Coherence Schemes for Multiprocessors

Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures Abstract 1 Introduction

Detailed Cache Coherence Characterization for Openmp Benchmarks

Analysis and Optimization of I/O Cache Coherency Strategies for Soc-FPGA Device

Chapter 5 Thread-Level Parallelism

2.2 Adaptive Routing Algorithms and Router Design 20

Adaptive Parallelism for Coupled, Multithreaded Message-Passing Programs Samuel K

Detecting False Sharing Efficiently and Effectively

Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach

Leaper: a Learned Prefetcher for Cache Invalidation in LSM-Tree Based Storage Engines

Parallel Processors and Cache Coherency

Comparative Modeling and Evaluation of CC-NUMA and COMA on Hierarchical Ring Rchitectures

CSC 256/456: Operating Systems