Mostly Lock-Free Malloc
Total Page:16
File Type:pdf, Size:1020Kb
Accepted for publication in ISMM ‘02. File=lfmalloc-ismm02-camera-c Last Revised Æ 02.05.23 1:17 AM Select text and change font color to “WHITE” to make invisible for final camera-ready copy. Paper #02. Mostly Lock-Free Malloc Dave Dice Alex Garthwaite Sun Microsystems, Inc. Sun Microsystems Laboratories Burlington, MA Burlington, MA [email protected] [email protected] ABSTRACT crucial and can often limit the application. Many malloc Modern multithreaded applications, such as application servers subsystems were written in a time when multiprocessor systems and database engines, can severely stress the performance of were rare, memory was scarce and memory access times were user-level memory allocators like the ubiquitous malloc regular. They use memory efficiently but are highly serial and subsystem. Such allocators can prove to be a major scalability constitute an obstacle to throughput for parallel applications. impediment for the applications that use them, particularly for Evidence of the importance of malloc comes from the wide applications with large numbers of threads running on high- array of after-market malloc replacement packages that are order multiprocessor systems. currently available. This paper introduces Multi-Processor Restartable Critical Our goal is to provide a malloc implementation with the Sections, or MP-RCS. MP-RCS permits user-level threads to following characteristics: know precisely which processor they are executing on and then ♦ Low latency. Malloc and free operations should complete to safely manipulate CPU-specific data, such as malloc quickly. metadata, without locks or atomic instructions. MP-RCS avoids interference by using upcalls to notify user-level threads when ♦ Excellent scalability. The allocator should provide high preemption or migration has occurred. The upcall will abort and throughput even under extreme load with large numbers of restart any interrupted critical sections. processors and even larger numbers of threads. We use MP-RCS to implement a malloc package, LFMalloc ♦ Memory-efficiency. The allocator should minimize (Lock-Free Malloc). LFMalloc is scalable, has extremely low wastage. latency, excellent cache characteristics, and is memory efficient. ♦ Orthogonality. The implementation should be We present data from some existing benchmarks showing that independent of threading models, and work with large LFMalloc is often 10 times faster than Hoard, another malloc numbers of threads in a preemptive multitasking replacement package. environment, where threads are permitted to migrate freely Categories and Subject Descriptors between processors. D.4.2 [Storage Management]: Allocation/deallocation ♦ Minimal interconnect traffic. Interconnect bandwidth is strategies. emerging as the limiting performance factor for a large class of commercial applications that run on multiprocessor General Terms systems. We reduce traffic by increasing locality. Algorithms, Design In addition to the raw performance of malloc itself, the Keywords secondary effects of malloc policies take on greater importance. Restartable Critical Sections, Affinity, Locality, malloc, Lock- A good malloc implementation can reduce the translation Free Operations lookaside buffer miss rate, the cache miss rate and the amount of interconnect traffic for an application accessing malloc-ed blocks. These effects are particularly important on large server 1. Introduction and Motivation systems with deep memory hierarchies. Many C and C++ applications depend on the malloc dynamic Malloc implementations exist in a wide spectrum. At one memory allocation interface. The performance of malloc can be extreme we find a single global heap protected by one mutex. The default “libc” malloc in the Solaris™ Operating Environment is of this design. This type of allocator can organize the heap with little wastage, but operations on the heap are serialized so the design doesn’t scale. At the other extreme a Copyright 2002 Sun Microsystems, Inc. malloc allocator can provide a private heap for each thread. ISMM'02, June 20-21, 2002, Berlin, Germany. Malloc operations that can be satisfied from a thread’s private ACM 1-58113-539-4/02/0006. heap don’t require synchronization and have low latency. When the number of threads grows large, however, the amount of 163 memory reserved in per-thread heaps can become unreasonable. and Krishnan’s LKMalloc [16], Golger’s PTMalloc [9], and Vee Various solutions have been suggested, such as adaptive heap and Hsu’s allocator [31]. sizing or trying to minimize the number of heaps in circulation McKenney and Bonwick describe kernel allocators. The kernel by handing off heaps between threads as they block and provides a favorable environment for malloc; code executing in unblock. the kernel has direct control over preemption and migration, and In the middle, we find Hoard [3], which limits the number of can efficiently query the ID of the current processor. Other heaps by associating multiple threads with a fixed number of user-level allocators attempt to approximate aspects of the local heaps. A thread is associated with a local heap by a simple kernel environment. Hoard and mtmalloc, for instance, hash a hashing function based on the thread’s ID. The number of heaps thread’s ID to approximate a processor ID. They also work best in Hoard is proportional to the number of CPUs. Since the local with a small number of user-level threads – user-level threads heaps are shared, synchronization is required for malloc and free approximate kernel-level CPUs. operations. To avoid contention on local heaps, Hoard provides more local heaps than processors. 3. Design and Implementation LFMalloc uses data structures and policies similar to those found in Hoard version 2.1.0, but unlike Hoard, LFMalloc needs LFMalloc consists of a malloc allocator built on top of the MP- only one heap per CPU. Malloc and free operations on shared RCS subsystem. The allocator uses MP-RCS to safely operate per-CPU heaps don’t usually require locking or atomic on CPU-local malloc metadata. The MP-RCS subsystem operations. Each allocating thread knows precisely which includes a kernel driver, a user-mode notification routine, and processor it is running on, resulting in improved data locality the restartable critical sections themselves. The kernel driver and affinity. Local heap metadata tends to stay resident in the monitors migration and preemption, and posts upcalls to the CPU, even though multiple threads will operate on the metadata. user-mode notification routine. The notification routine (a) tracks which processor a thread is running on, and (b) aborts and recovers from any interrupted critical sections. A restartable 2. Related Work critical section is a block of code that updates CPU-specific data. Using restartable critical sections we have constructed a Much research has been done in the area of kernel-assisted non- transactional layer that supports conditional load and store blocking synchronization. In [4] and [5] Bershad introduces operations. The initial implementation of LFMalloc is on restartable atomic sequences. Mosberger et al. [23] [24] and SPARC™ Solaris. Moran et al. [22] did related work. Johnson and Harathi describe interruptible critical sections in [14]. Takada and 3.1 MP-RCS Sakamura describe abortable critical sections in [29]. Moss and Kohler describe restartable routines for the trellis/owl MP-RCS permits multiple threads running concurrently on programming environment in [25]. Scheduler activations, in different CPUs to operate on CPU-specific data in a consistent [28] and [1], make kernel scheduling decisions visible to user manner and without interference. MP-RCS doesn’t presume any programs. Kontothanassis et al. discuss scheduler-conscious particular thread model - threads may be preemptively scheduled synchronization in [15]. and may migrate between processors at any time. A primary influence on our work was the paper by Shivers et al. In the general case, interference can come from either [27] on atomic heap transactions and fine-grained interrupts for preemption or from threads running concurrently on other uniprocessors. In that work, the authors describe their efforts to CPUs. Since restartable critical sections operate only on CPU- develop efficient allocation services for use in an operating specific data, we need only be concerned with preemption. For system written in ML, how these services are best represented as a uniprocessor system or for a multiprocessor system accessing restartable transactions, and how they interact well with the CPU-specific data, interference is solely the result of kernel kernel's need to handle interrupts efficiently. A major influence preemption. on their project is the work done at MIT on the ITS operating Techniques to avoid interference fall into two major categories: system. That operating system used a technique termed PCLSR mutual exclusion-based or lock-free [11][10][30]. Mutual [2] that ensured that, when any thread has examined another exclusion is usually implemented with locks. It is pessimistic – thread's state, the observed state was consistent. The basic mutual exclusion techniques prevent interference. If a thread mechanism was a set of protocols to roll a thread's state either holding a traditional lock is preempted, other threads will be forward or backward to a consistent point at which they could be denied access to the critical section, resulting in convoys, suspended. excessive context switching, and the potential for priority Hudson