Simple but Effective Techniques for NUMA Memory Management

Simple But Effective Techniques for NUMA Memory Management William J. Boloskyl Robert P. Fitzgerald2 Michael L. Scott’ Abstract In order to maximize locality of reference, data repiica- tion and migration can be performed in hardware (with Multiprocessors with non-uniform memory access times consistent caches), in the operating system, in compilers introduce the problem of placing data near the processes or library routines, or in application-specific user code. that use them, in order to improve performance. We The last option can place an unacceptable burden on have implemented an automatic page placement strat- the programmer and the first, if feasible at all in large egy in the Mach operating system on the IBM ACE machines, will certainly be expensive. multiprocessor workstation. Our experience indicates We have developed a simple mechanism to automat- that even very simple automatic strategies can produce ically assign pages of virtual memory to appropriately nearly optimal page placement. It also suggests that the located physical memory. By managing locality in the greatest leverage for further performance improvement operating system, we hide the details of specific mem- lies in reducing false sharing, which occurs when the ory architectures, so that programs are more portable. same page contains objects that would best be placed We also address the locality needs of the entire applica- in different memories. tion mix, a task that cannot be accomplished through independent modification of individual applications. Fi- nally, we provide a migration path for application de- 1 Introduction velopment. Correct parallel programs will run on our system without modification. If better performance is Shared-memory multiprocessors are attractive machines desired, they can then be modified to better exploit au- for parallel computing. They not only support a model tomatic page placement, by placing into separate pages of parallelism based on the familiar von Neumann data that are private to a process, data that are shared paradigm, they also allow processes to interact effi- for reading only, and data that are writably shared. This ciently at a very fine level of granularity. Simple physics segregation can be performed by the applications pro- dictates, however, that memory cannot simultaneously grammer on an ad hoc basis or, potentially, by special be located very close to a very large number of proces- language-processor based tools. sors. Memory that can be accessed quickly by one node We have implemented our page placement mechanism of a large multiprocessor will be distant from many other in the Mach operating system[l] on a small-scale NUMA nodes. Even on a small machine, price/performance machine, the IBM ACE multiprocessor workstation[9]. may be maximized by an architecture with non-uniform We assume the existence of faster memory that is local memory access times. to a processor and slower memory that is global to all On any Non-Uniform Memory Access (NUMA) ma- processors, and believe that our techniques will gener- chine, performance depends heavily on the extent to alize to any machine that fits this general model. which data reside close to the processes that use them. Our strategy for page placement is simple, and was embedded in the machine-dependent portions of the 1 Department of Computer Science, Mach memory management system with only two man- University of Rochester, Rochester, NY 14627. internet: boloskyOcs.rochester.edu, scott&s.rochester.edu months of effort and 1500 lines of code. We use local 21BM Thomas J. Watson Research Center, memory as a cache over global, managing consistency PO Box 218, Yorktown Heights, NY 10598-0218. with a directory-based ownership protocol similar to internet: [email protected] that used by Li[15] for distributed shared virtual mem- Permission to copy without fee all or part of this material is granted provided ory. Briefly put, we replicate read-only pages on the pro- that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, cessors that read them and move written pages to the and notice is given that copying is by permission of the Association for processors that write them, permanently placing a page Computing Machinery. To copy otherwise, or to republish, requires a fee in global memory if it becomes clear that it is being writ- and/or specific permission. 0 1989 ACM 08979L338-3/89/0012/0019 $1.50 ten routinely by more than one processor. Specifically, 19 we assume when a program begins executing that every which removes a single physical page from all pmaps in page is cacheable, and may be placed in local memory. which it is resident. Other operations create and de- We declare that a page is noncacheable when the con- stroy pmaps, fill pages with zeros, copy pages, etc. The sistency protocol has moved it between processors (in protection provided to the pmap-enter operation is not response to writes) more than some small fixed number necessarily the same as that seen by the user; Mach may of times. All processors then access the page directly in reduce privileges to implement copy-on-write or as part global memory. of the external paging system[22]. We believe that simple techniques can yield most of A pmap is a cache of the mappings for an address the locality improvements that can be obtained by an space. The pmap manager may drop a mapping or re- operating system. Our results, though limited to a sin- duce its permissions, e.g. from writable to read-only, at gle machine, a single operating system, and .a modest almost any time. This may cause a page fault, which number of applications, support this intuition. We have will be resolved by the machine-independent VM code achieved good performance on several sample applica, resulting in another pmap-enter of the mapping. This tions, and have observed behavior in others that sug- feature had already been used on the IBM RT/PC, gests that no operating system strategy will olbtain sig- whose memory management hardware only allows a sin- nificantly better results without also making language- gle virtual address for a physical page. We use it on the processor or application-level improvements in the way ACE to drive our consistency protocol for pages cached that data are grouped onto pages. in local memory. We describe our implementation in Section 2. We Mappings can be dropped, or permissions reduced, include a brief overview of the Mach memory manage- subject to two constraints. First, to ensure forward ment system and the ACE architecture. We present progress, a mapping and its permissions must persist performance measurements and analysis in Section 3. long enough for the instruction that faulted to complete. Section 4 discusses our experience and what we think Second, to ensure that the kernel works correctly, some it means. Section 5 suggests several areas for .future re- mappings must be permanent. For example, the kernel search. We conclude with a summary of what we have must never suffer a page fault on the code that handles learned about managing NUMA memory. page faults. This second constraint limits our ability to use automatic NUMA page placement for kernel memory. 2 Automatic Page Placement Mach views physical memory as a fixed-size pool of in a Two-Level Memory pages. It treats the physical page pool as if it were real memory with uniform memory access times. It is un- derstood that in more sophisticated systems these “ma- 2.1 The Mach Virtual Memory chine independent physical pages” may represent more System complex structures, such as pages in a NUMA mem- Perhaps the most important novel idea in Mach is that ory or pages that have been replicated. Unfortunately, of machine-independent virtual memory[l7]. The bulk there is currently no provision for changing the size of of the Mach VM code is machine-independent and is the page pool dynamically, so the maximum amount of supported by a small machine-dependent component, memory that can be used for page replication must be called the pmap layer, that manages address translation fixed at boot time. hardware. The pmap interface separates the machine- dependent and machine-independent parts of the VM 2.2 The IBM ACE Multiprocessor system. Workstation A Mach pmap (physical map) is an abstract ob- ject that holds virtual to physical address translations, The ACE Multiprocessor Workstation[S] is a NUMA called mappings, for the resident pages of a single virtual machine built at the IBM T. J. Watson Research Center. address space, which Mach calls a task. The pmap in- Each ACE consists of a set of processor modules and terface consists of such pmap operations as pmap-enter, global memories connected by a custom global memory which takes a pmap, virtual address, physicarl address bus (see Figure 1). Each ACE processor module has and protection and maps the virtual addre,ss to the a ROMP-C processor[l2], Rosetta-C memory manage- physical address with the given protection in the given ment unit and 8Mb of local memory. Every processor pmap; pmap-protect, which sets the protection on all can address any memory, with non-local requests sent resident pages in a given virtual address range within a over a 32-bit wide, 80 Mbyte/set Inter-Processor Com- pmap; pmap,remove, which removes all mappings in a munication (IPC) bus designed to support 16 processors virtual address range in a pmap; and pmap-remove-all, and 256 Mbytes of global memory. 20 1’ - ---- 1 d Pmap manager ++ NUMA manager 1 r t Figure 1: ACE Memory Architecture mmu interface NUMA policy Packaging restrictions prevent ACES from supporting Figure 2: ACE pmap Layer the full complement of memory and processors permit- ted by the IPC bus.

Load more