Memory Coherence in Shared Virtual Systems

Memory Coherence in Shared Virtual Memory Systems 1 Kai Li and Paul Hudak Department of Computer Science Yale University New Haven, CT 06520 Abstract The shared virtual memory that we will describe not only "pages" data between physical memories and disks, This paper studies the memory coherence problem in de- as in a conventional virtual memory system, but it also signing and implementing a shared virtual memory on loosely- "pages" data between the physical memories of the individ- coupled multiprocessors. Two classes of algorithms for solv- ual processors. Thus data can naturally migrate between ing the problem are presented. A prototype shared virtual processors on demand. Furthermore, just as a conventional memory on an Apollo ring has been implemented based v irtual memory also pages processes, so does the shared vir- on these algorithms. Both theoretical and practical results tual memory. Thus our approach provides a very natural show that the memory coherence problem can indeed be and efficient form of process migration between processors solved efficiently on a loosely-coupled multiprocessor. in a distributed system, normally a very difficult feature to implement well (and in effect subsuming the notion of 1 Introduction remote procedure call). The benefits of a virtual memory go without saying, and almost every high-performance sequential computer in ex- l CPU 1 CPU 2 CPU N / istence today incorporates one. Virtual memories are so useful that it is hard to believe that parallel architectures Memory 1 Memory 2 Memory N would not also benefit from them. Indeed, one can easily imagine how virtual memory would be incorporated into a I " / / shared-memory parallel machine, since the memory hierar- \ x \ I , \ / r chy need not be much different from that of a sequential N \ / / machine. On the other hand, on a "loosely-coupled multi- \ / processor" in which the physical memory is distributed, the \ / \ VirtualShared t / implementation is not as obvious, and to our knowledge no \ Memory / such implementation exists. " I The shared virtual memory described in this paper provides a virtual address space which is shared among all Figure 1: Shared virtual memory mapping. processors in a loosely-coupled multiprocessor system, as shown graphically in Figure 1. The shared memory itself The main difficulty in building a shared virtual memory exists only virtually. Application programs can use it in is solving the memory coherence problem. This problem is the same way as a traditional virtual memory, except, of similar to that which arises with conventional caches (see course, that processes can run on different processors in [14] for a survey), but in particular with multicache schemes parallel. for shared memory multiprocessors [16,1,7,18,6,19,13]. In this paper we concentrate on the memory coherence problem for a shared virtual memory. A number of algorithms axe presented, analyzed, and compared. Several of the algorithms have been implemented on a local area network of Apollo workstations. We present experimental results on non-trivial parallel programs that demonstrate the viabil- ity of shared virtual memory even on very loosely-coupled systems such as the Apollo network. Our success suggests a 1This research was supported in part by NSF Grants MCS-8302018 and DCR-8106181. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specfic permission. © 1986 ACM 0-89791-198-9/86/0800-0229 75¢ 229 radically different viewpoint of such architectures, in which nications overhead as long as a page can fit into a packet. one can exploit the total processing power and memory ca- In addition, such a choice allows us to use existing page- pabilities of such systems in a far more unified way than fault schemes (i.e., hardware mechanisms) that allow single the traditional "message-passing" approach. instructions to trigger page-faults and trap to appropriate fault handlers. This can be done by setting the access rights to the pages in such a way that memory accesses that could 2 Design Choices for Memory Co- violate memory coherence cause a page fault, and thus the herence memory coherence problem can be solved in a modular way in the page fault handlers. Our design goals require that the shared virtual memory Part of the justification for using page size granularity, be coherent. A memory is coherent if the value returned of course, is that memory references in sequential programs by a read operation is always the same as the value written generally have a high degree of locality [3,4]. Although by the most recent write operation to the same address. memory references in parallel programs may behave differ- Coherence can be maintained if a shared virtual memory ently from those in sequential ones, a single process remains satisfies the following single constraint: a sequential program, and should exhibit a high degree of locality. Contention among parallel processes for the same • A processor is allowed to update a piece of data only piece of data depends on the algorithm, of course, but a while no other processor is updating or reading it. common goal in designing parallel algorithms is to mini- This allows many processors to read a piece of data as long mize such contention for optimal performance. as no other processor is updating it, and is a form of the 2.2 Memory Coherence Strategies well-known readers/writers problem. There are two design choices that greatly influence the It is helpful first to consider the spectrum of strategies implementation of a shared virtual memory: the granular- one may choose from to solve the memory coherence prob- ity of the memory units, and the strategy for maintaining lem. These strategies may be classified by the way in which coherence. one deals with page synchronization and page ownership, as shown in Table 1. 2.1 Granularity Page synchronization The size of the "memory units" that are to be coherently maintained is an important consideration in a shared vir- There are two basic approaches to page synchronization: tual memory. We discuss in this section several criteria for invalidation and writeback. In the invalidation approach, choosing this granularity. if a processor has a write fault, the fault handler will copy In a typical loosely-coupled multiprocessor system, send- the true page containing the memory location, invalidate all ing large packets of data (say one thousand bytes) is not other copies of the page, change the access of the page to much more expensive than sending small ones (say less than write, and return to the faulting instruction. After return- ten bytes) [15]. This is usually due to the typical software ing, the processor "owns" that page and can proceed with protocols and overhead of the virtual memory layer of the the write operation and other read or write operations until operating system. This fact makes relatively large memory the page ownership is relinquished to some other processor. units seem feasible. In the writeback approach, if a processor has a write On the other hand, the larger the memory unit, the fault, the fault handler will write to all copies of the page, greater the chance for contention. Memory contention oc- and then return to the faulting instruction. In a sense this curs when two processors attempt to write to the same approach seems ideal in that it supports the broadest no- location (as in a shared memory system) as well as when tion of sharing (indeed it simulates a centralized shared two processors attempt to write to different locations in memory!), but note that every write to a shared page will the same memory unit. Although clever memory alloca- generate a fault on the writing processor and update all tion strategies might minimize contention by arranging con- copies. Clearly doing these updates will be very expensive, current memory accesses to locations in different memory and algorithms using writeback do not seem appropriate for units, such a strategy would lead to the inefficient use of loosely coupled multiproeessors. Thus we do not consider memory space and introduce an inconvenience to the pro- them further in this paper, as indicated in Table 1. grammer. Thus the possibility of contention pushes one toward relatively small memory units. Page ownership A suitable compromise in granularity is the typical page used in a conventional virtual memory implementation. The The ownership of a page can be handled .either statically or page sizes of today's computers vary, typically from 256 dynamically. In the static approach, a page is always owned bytes to 2k bytes. Choosing this size of a memory unit has by the same processor. This means that other processors several advantages. First, experience has shown that such are never given full write access to the page; rather they sizes are suitable with respect to contention, and by our must negotiate with the owning processor, and must gener- previous argument they should not impose undue commu- ate a write fault every time they need to update the page. 230 Page ownership strategy Page Dynamic synchronization Static Distributed manager method Centralized manager Fixed Dynamic Invalidation appropriatenot okay good good not not not not Writeback appropriate appropriate appropriate appropriate Table 1: Spectrum of solutions to the memory coherence problem. As with the writeback approach, this also is an expensive 2. The copy_set field lists all processors that have copies solution for existing loosely-coupled multiprocessors, and of the page.

Memory Coherence in Shared Virtual Systems

Parallel Computer Architecture

Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach

Thread-Level Parallelism – Part 1

Efficient Synchronization Mechanisms for Scalable GPU Architectures

CIS 501 Computer Architecture This Unit: Shared Memory

Shared-Memory Multiprocessors Gates & Transistors

Tardis: Time Traveling Coherence Algorithm for Distributed Shared Memory

Gpgpu & Accelerators

CIS 501 Computer Architecture This Unit: Shared Memory Multiprocessors Readings Multiplying Performance

Snooping Cache Coherence I

Introduction and Cache Coherency

1 Parallel Computer Architecture Concepts