<<

Accepted for publication in ISMM ‘02. File=lfmalloc-ismm02-camera- Last Revised Æ 02.05.23 1:17 AM Select text and change font color to “WHITE” to make invisible for final camera-ready copy. Paper #02.

Mostly -Free Malloc

Dave Dice Alex Garthwaite , Inc. Sun Microsystems Laboratories Burlington, MA Burlington, MA [email protected] [email protected]

ABSTRACT crucial and can often limit the application. Many malloc Modern multithreaded applications, such as application servers subsystems were written in a time when multiprocessor systems and engines, can severely stress the performance of were rare, memory was scarce and memory access times were user-level memory allocators like the ubiquitous malloc regular. They use memory efficiently but are highly serial and subsystem. Such allocators can prove to be a major constitute an obstacle to throughput for parallel applications. impediment for the applications that use them, particularly for Evidence of the importance of malloc comes from the wide applications with large numbers of threads running on high- array of after-market malloc replacement packages that are order multiprocessor systems. currently available. This paper introduces Multi-Processor Restartable Critical Our goal is to provide a malloc implementation with the Sections, or MP-RCS. MP-RCS permits user-level threads to following characteristics: know precisely which processor they are executing on and then ♦ Low latency. Malloc and free operations should complete to safely manipulate CPU-specific data, such as malloc quickly. metadata, without locks or atomic instructions. MP-RCS avoids interference by using upcalls to notify user-level threads when ♦ Excellent scalability. The allocator should provide high or migration has occurred. The upcall will abort and throughput even under extreme load with large numbers of restart any interrupted critical sections. processors and even larger numbers of threads. We use MP-RCS to implement a malloc package, LFMalloc ♦ Memory-efficiency. The allocator should minimize (Lock-Free Malloc). LFMalloc is scalable, has extremely low wastage. latency, excellent cache characteristics, and is memory efficient. ♦ Orthogonality. The implementation should be We present data from some existing benchmarks showing that independent of threading models, and work with large LFMalloc is often 10 times faster than Hoard, another malloc numbers of threads in a preemptive multitasking replacement package. environment, where threads are permitted to migrate freely Categories and Subject Descriptors between processors. D.4.2 [Storage Management]: Allocation/deallocation ♦ Minimal interconnect traffic. Interconnect bandwidth is strategies. emerging as the limiting performance factor for a large class of commercial applications that run on multiprocessor General Terms systems. We reduce traffic by increasing locality. Algorithms, Design In addition to the raw performance of malloc itself, the Keywords secondary effects of malloc policies take on greater importance. Restartable Critical Sections, Affinity, Locality, malloc, Lock- A good malloc implementation can reduce the translation Free Operations lookaside buffer miss rate, the cache miss rate and the amount of interconnect traffic for an application accessing malloc-ed blocks. These effects are particularly important on large server 1. Introduction and Motivation systems with deep memory hierarchies. Many C and C++ applications depend on the malloc dynamic Malloc implementations exist in a wide spectrum. At one memory allocation interface. The performance of malloc can be extreme we find a single global heap protected by one mutex. The default “libc” malloc in the Solaris™ Operating Environment is of this design. This type of allocator can organize the heap with little wastage, but operations on the heap

are serialized so the design doesn’t scale. At the other extreme a Copyright 2002 Sun Microsystems, Inc. malloc allocator can provide a private heap for each . ISMM'02, June 20-21, 2002, Berlin, Germany. Malloc operations that can be satisfied from a thread’s private ACM 1-58113-539-4/02/0006. heap don’t require synchronization and have low latency. When the number of threads grows large, however, the amount of

163

memory reserved in per-thread heaps can become unreasonable. and Krishnan’s LKMalloc [16], Golger’s PTMalloc [9], and Vee Various solutions have been suggested, such as adaptive heap and Hsu’s allocator [31]. sizing or trying to minimize the number of heaps in circulation McKenney and Bonwick describe kernel allocators. The kernel by handing off heaps between threads as they block and provides a favorable environment for malloc; code executing in unblock. the kernel has direct control over preemption and migration, and In the middle, we find Hoard [3], which limits the number of can efficiently query the ID of the current processor. Other heaps by associating multiple threads with a fixed number of user-level allocators attempt to approximate aspects of the local heaps. A thread is associated with a local heap by a simple kernel environment. Hoard and mtmalloc, for instance, hash a hashing function based on the thread’s ID. The number of heaps thread’s ID to approximate a processor ID. They also work best in Hoard is proportional to the number of CPUs. Since the local with a small number of user-level threads – user-level threads heaps are shared, synchronization is required for malloc and free approximate kernel-level CPUs. operations. To avoid contention on local heaps, Hoard provides more local heaps than processors. 3. Design and Implementation LFMalloc uses data structures and policies similar to those found in Hoard version 2.1.0, but unlike Hoard, LFMalloc needs LFMalloc consists of a malloc allocator built on top of the MP- only one heap per CPU. Malloc and free operations on shared RCS subsystem. The allocator uses MP-RCS to safely operate per-CPU heaps don’t usually require locking or atomic on CPU-local malloc metadata. The MP-RCS subsystem operations. Each allocating thread knows precisely which includes a kernel driver, a user-mode notification routine, and processor it is running on, resulting in improved data locality the restartable critical sections themselves. The kernel driver and affinity. Local heap metadata tends to stay resident in the monitors migration and preemption, and posts upcalls to the CPU, even though multiple threads will operate on the metadata. user-mode notification routine. The notification routine (a) tracks which processor a thread is running on, and (b) aborts and recovers from any interrupted critical sections. A restartable 2. Related Work is a block of code that updates CPU-specific data. Using restartable critical sections we have constructed a Much research has been done in the area of kernel-assisted non- transactional layer that supports conditional load and store synchronization. In [4] and [5] Bershad introduces operations. The initial implementation of LFMalloc is on restartable atomic sequences. Mosberger et al. [23] [24] and SPARC™ Solaris. Moran et al. [22] did related work. Johnson and Harathi describe interruptible critical sections in [14]. Takada and 3.1 MP-RCS Sakamura describe abortable critical sections in [29]. Moss and Kohler describe restartable routines for the trellis/owl MP-RCS permits multiple threads running concurrently on programming environment in [25]. Scheduler activations, in different CPUs to operate on CPU-specific data in a consistent [28] and [1], make kernel decisions visible to user manner and without interference. MP-RCS doesn’t presume any programs. Kontothanassis et al. discuss scheduler-conscious particular thread model - threads may be preemptively scheduled synchronization in [15]. and may migrate between processors at any time. A primary influence on our work was the paper by Shivers et al. In the general case, interference can come from either [27] on atomic heap transactions and fine-grained for preemption or from threads running concurrently on other uniprocessors. In that work, the authors describe their efforts to CPUs. Since restartable critical sections operate only on CPU- develop efficient allocation services for use in an operating specific data, we need only be concerned with preemption. For system written in ML, how these services are best represented as a uniprocessor system or for a multiprocessor system accessing restartable transactions, and how they interact well with the CPU-specific data, interference is solely the result of kernel kernel's need to handle interrupts efficiently. A major influence preemption. on their project is the work done at MIT on the ITS operating Techniques to avoid interference fall into two major categories: system. That used a technique termed PCLSR -based or lock-free [11][10][30]. Mutual [2] that ensured that, when any thread has examined another exclusion is usually implemented with locks. It is pessimistic – thread's state, the observed state was consistent. The basic mutual exclusion techniques prevent interference. If a thread mechanism was a set of protocols to roll a thread's state either holding a traditional lock is preempted, other threads will be forward or backward to a consistent point at which they could be denied access to the critical section, resulting in convoys, suspended. excessive context switching, and the potential for priority Hudson et al. [12] propose clever allocation sequences that use inversion. the IA64’s predicate registers and predicated instructions to Lock-free operations using compare-and-swap (CAS), load- annul interrupted critical sections. Similarly, the functional locked/store-conditional (LL-SC) and MP-RCS are optimistic – recovery routines [13] and related mechanisms of the OS/390 they detect and recover from interference. A typical lock-free MVS operating system allow sophisticated transactions of this transaction fetches data, computes a provisional value and type. attempts to commit the update. If the update fails, the program Finally, a number of recent papers emphasize allocators that are must retry the transaction. Unlike locking techniques, lock-free friendly to multiprocessor systems. These include Berger’s operations behave gracefully under preemption. If a thread is Hoard [3], Bonwick’s Vmem allocator [7], McKenney’s kernel interrupted in a critical section, other threads are permitted to memory allocator [20], Nakhimovsky’s mtmalloc [26], Larson enter and pass through the critical section. When the interrupted thread is eventually rescheduled it will restart the critical

2 164

section. Since the thread is at the beginning of a full quantum, it critical section generates a page fault then kernel would need to will normally be able to complete the critical section without either spin, awaiting the page, or preempt the thread. Both further quantum-based preemption. Lock-free operations are OFFPROC notification and user-mode preemption control alter also immune to . Note that the term critical section is the kernel’s scheduling decisions, where MP-RCS does not. usually applied to mutual exclusion, but in this paper we mean it The kernel driver posts an upcall by saving the interrupted user- in the more general sense of atomic sequence. mode instruction pointer and substituting the address of the user- Under MP-RCS, if a thread executing in a restartable critical mode upcall routine. The upcall is deferred - the next time the section either (a) migrates to another CPU, or (b) is preempted, thread returns to user-mode it will resume in the upcall routine and other related threads run on that processor, then the instead of at the interrupted instruction. interrupted operation may be “stale” and must not be permitted A precondition for interference is that a thread must be to resume. To avoid interference, the kernel driver posts an preempted while executing in a critical section. The kernel upcall to a user-mode recovery routine. The upcall, which runs driver checks for preemption and, if needed, posts an upcall. the next time the thread returns to user-mode, aborts the The user-mode notification routine checks if the preempted interrupted operation and restarts the critical section by thread was executing in a critical section. For example, say that transferring control to the first instruction in the critical section. thread T is executing on a CPU. T is preempted and the kernel A thread that has been preempted and may hold stale state is not picks a related thread, S, to run on that CPU. S runs and then permitted to commit. blocks. The kernel then picks T to run on the CPU. The kernel Figure 3 describes the control flow used by restartable critical driver will post an upcall to T. The upcall is precautionary – the sections. Table 1 provides a detailed example of the actions that driver sends the upcall even if S never executed in a critical take place when a critical section is interrupted. section and even if T was not interrupted while in a critical section. The notification routine checks to see if T was executing 3.1.1 Restartable Critical Sections in a critical section, and, if necessary, restarts the critical section. A restartable critical section is simply a block of code. The Interference will always result in an upcall, but not all upcalls start, end, and restart addresses of all critical sections are known indicate true interference. The current kernel driver is unable to to the notification routine. determine if a preempted thread was executing in a critical section – that duty lies with the notification routine. By A restartable critical section proceeds without locks, memory partitioning responsibility – the driver checks for preemption barriers, or atomic instructions, using simple load and store and the notification routine checks for interrupted critical instructions. Restartable critical sections take the following sections – we have minimized the amount of logic in the kernel form: (a) fetch the processor ID from thread-local storage and driver, but our design incurs additional upcalls. locate CPU-specific data, (b) optionally fetch one or more CPU- specific variables and compute a provisional value, (c) attempt The kernel driver uses a set of preexisting hooks to store that new value into a CPU-specific variable. As seen in present in Solaris. These hooks are used to load and unload Figure 3, restartable critical sections always end with a store performance counters and hardware device state that might be instruction – we say the store commits the transaction. specific to a thread. The driver merges multiple pending upcalls which can result when a thread is preempted and then migrates. 3.1.2 Kernel Driver The kernel driver passes the current processor ID and the original interrupted instruction pointer (IP) as arguments into the The driver makes kernel scheduling decisions, which are notification routine. The notification routine will examine the normally hidden, visible to user-mode threads. The driver interrupted IP to see if it was within a restartable critical section. doesn’t change scheduling policies or decisions in any manner but instead informs the affected threads. It doesn’t change As an optimization the kernel driver can sometimes forgo which threads will run or when they will run but only where preemption notification. When a thread that was previously they will resume: in a notification routine or at the original preempted resumes on the same CPU, the kernel driver can skip interrupted instruction. notification if no “related” threads have run on the CPU in the interim. We define threads as related if they have the potential Referring to Figure 3, upcalls are delivered at the ONPROC to access common CPU-specific data. When unrelated threads transition. ONPROC is a Solaris term for “on processor” – it is alternate execution on the same CPU there is no risk of the thread state transition between ready and running. Solaris interference. In addition, the kernel driver does not deliver scheduling states are described in detail in [18]. upcalls to threads that have blocked voluntarily. This reduces Running the notification routine at the OFFPROC “off the upcall rate but means that blocking system calls can not processor” transition seems more intuitive, but would violate currently be used within restartable critical sections. (We may Solaris scheduling invariants and introduce subtle relax this restriction in the future). complications. The OFFPROC notification routine might, for example, incur a page fault and need to block (go OFFPROC) to 3.1.3 User-mode Notification Routine await the page. In addition, the OFFPROC callback can’t The user-mode notification routine aborts and restarts provide a “fresh” processor ID to the thread. Providing the interrupted critical sections as well as tracks which CPU a thread ability to determine the ID of the current processor and to defer is currently executing on. Critically, the notification routine or disable preemption while in a critical section would seem to executes before control could pass back into an interrupted offer the same benefits as MP-RCS but in a more critical section. This gives the routine the opportunity to vet straightforward manner. Such a scheme, however, suffers from a and, if needed, to abort the critical section. similar “page fault” problem – if preemption is disabled and the

3 165

The kernel passes the current processor ID to the notification between threads and processors; under MP-RCS threads may be routine which stores the value in thread-specific storage. bound to specific processors or allowed to migrate freely. Subsequent restartable critical sections fetch the processor ID Finally, developing MP-RCS reinforced the lesson that fast from thread-specific storage and use it to locate the appropriate thread-local storage accessors, as also noted in [6], are crucial to CPU-specific data, such as malloc metadata. Upcalls give the thread a precise processor ID – used inside a restartable critical performance. section, the value will never be stale. 3.2 Malloc and Free We originally implemented upcalls with UNIX signals, but This section briefly describes the Hoard allocator, contrasts signals had specific side-effects, such as expedited return from Hoard with the LFMalloc allocator and describes the operation blocking system calls, that proved troublesome. Still, upcalls of malloc and free in LFMalloc. are similar in spirit to signals. Finally, to avoid nested upcalls, MP-RCS inhibits notification 3.2.1 Hoard while an upcall is executing. The upcall hand-shakes with the Hoard uses segregated free lists with size-classes [32]. Sizes in kernel to indicate that it is finished. Also, upcalls have the range (48,56] bytes, for instance, are lumped together and precedence over asynchronous signals. considered one size-class. A malloc request for 50 bytes would 3.1.4 Transactional Layer be quantized into the size-class (48,56] and a 56-byte block would be allocated. In our current implementation restartable critical sections must Hoard has a single global heap and multiple local heaps. Each be written in assembler using a non-destructive idiom, where heap contains an array of pointers, indexed by size-class, to none of the input arguments can be destroyed until after the groups of superblocks. All superblocks in that group contain commit point. To make it easier to employ MP-RCS we have blocks of the given size-class. A superblock consists of a header developed a small transactional layer written in assembler, but and an array of contiguous blocks of a single size-class. The callable from C. The layer is built on top of small restartable superblock header contains a LIFO list of free blocks within the critical sections. It supports starting a transaction, conditional superblock. A block contains a pointer to the enclosing loads, and conditional stores. Each thread has a private superblock, alignment padding, and the data area. Free blocks “interference” flag. Starting a transaction clears the flag. The contain pointers to the next free block – forming the free list. load and store operators, protected by restartable critical The malloc operator allocates a block from a free list and returns sections, test the flag, and then execute or annul the memory the address of the data area. A superblock is associated with at operation accordingly. The notification routine sets the flag, most one heap at any given time. We say that heap “owns” the indicating potential interference. Conditional loads can be used superblock. to protect against loading from stale pointers. All superblocks are the same overall length – a multiple of the 3.1.5 Discussion system page size - and are allocated from the system. All Upcalls and restartable critical sections are time-efficient. On a blocks in a superblock are in same size-class. The maximum 400 MHz UltraSPARC-II system the latency of an upcall is 13 step between adjacent size-classes is 20%, constraining internal µsecs. Restarting a critical section entails reexecution of a fragmentation to no more than 20%. Hoard handles large portion of the critical section. These are wasted cycles, but in requests by constructing a special superblock that contains just practice critical sections are short and upcalls are infrequent. An one block – this maintains the invariant that all blocks must be important factor in the upcall rate is the length of a quantum. If contained within a superblock. Superblocks are size-stable. the duration of a restartable critical section approaches the They remain of a fixed size-class until the last in-use block is length of a quantum, the critical section will fail to make released. At that point, the superblock can be “reformatted” to progress. This same restriction applies to some LL-SC support other size-classes. implementations, like the one provided by the DEC Alpha. 3.2.2 Comparing LFMalloc and Hoard MP-RCS confers other benefits. Restartable critical sections are LFMalloc shares all the previous characteristics with Hoard but not vulnerable to the A-B-A problem [30] associated with CAS. differs in the following: More subtly, reachable objects are known to be live. Take the example of a linked list with concurrent read and delete 1. In LFMalloc a local heap may own at most one superblock operations. We need to ensure that a reader traversing the list of a given size-class while the global heap may own doesn’t encounter a deleted node. The traditional way to multiple superblocks of a given size-class. Hoard permits prevent such a race is to lock nodes – such locks are usually both global and local heaps to own multiple superblocks of stored with the pointer to the object, not the object itself. A a size-class. Hoard will move superblocks from a local restartable critical section can traverse such a list without any heap to the global heap if the proportion of free blocks to locking. If a reader is preempted by a delete operation the in-use blocks in the local heap passes the emptiness reader will restart and never encounter any deleted nodes. The threshold. Because LFMalloc limits the number of Read-Copy Update protocol [19][8] describes another approach superblocks of a given size to one, it does not need to take to the problem of existence guarantees. special measures to prevent excessive blocks from accumulating in a local heap. MP-RCS does not impose any particular thread model; it works equally well with a 1:1 model or a multi-level model where 2. LFMalloc requires only P local heaps (P is the number of many “logical” threads are multiplexed on fewer “real” threads. processors), where Hoard must provide extra heaps to In addition, MP-RCS operates independently of the relationship

4 166

avoid heap lock contention. Hoard works best then the [13] b Å pop a block from S.LocalFreeList number of threads is less than the number of local heaps. [14] } 3. In Hoard, a set of threads, determined by a hash function [15] if b != null, return b // fast path applied to the thread ID, will share a local heap. During its [16] // Superblock S is empty – replace it tenure a thread will use only one local heap. Hoard has the [17] search the global heap for a potential for a class of threads, running concurrently on [18] non-full superblock R of size-class sc different processors, to map to a single “hot” local heap. In [19] if none found, LFMalloc, threads are associated with local heaps based on [20] construct a new superblock R which processor they are executing on. Concurrent malloc [21] else operations are evenly distributed over the set of local heaps. [22] lock R The association between local-heaps and CPUs is fixed in [23] Merge R.RemoteFreeList into LFMalloc while in Hoard the association varies as threads [24] R.LocalFreeList migrate between CPUs. [25] unlock R 4. Since LFMalloc has fewer superblocks in circulation, we [26] // R goes online, S goes offline increased the superblock size from 8192 to 65536 bytes [27] Try to install R in H[sc], replacing S without appreciably affecting the peak heap utilization. [28] If failed, 5. In LFMalloc, each superblock has two free lists: a local [29] move R to global heap free list which is operated on by restartable critical sections [30] else and a remote free list protected by a traditional system [31] move S to global heap mutex. The two lists are disjoint. Malloc and free [32] Goto Retry operations on a local heap manipulate the local free list without locks or atomic instructions. A thread allocates blocks from its local heap but can free a block belonging to Following the pseudo-code above, malloc finds the local heap H any heap. Blocks being freed in a non-local heap are added associated with the current processor using the processor ID to the remote free list. In LFMalloc, traditional locking is supplied by the notification routine. Using the local heap, used only for operations on remote free lists and for malloc locates the superblock S of the appropriate size-class and operations on the global heap. attempts to unlink a block b from the local free list. If no superblock of the size-class exists, or if the superblock’s local 6. When a superblock becomes entirely free LFMalloc can free list is empty, malloc will find or construct a new superblock return the superblock to the operating system or reformat it. and then attempt to install it in the local heap. Finally, the Hoard never returns superblocks to the operating system. previous superblock, if it exists, is returned to the global heap – 3.2.3 LFMalloc: Malloc and Free Algorithms it transitions offline. Pseudo-code for free (b) In LFMalloc a superblock is either online or offline. An online superblock is associated with a processor – it belongs to a local [1] if the block is large, heap. An offline superblock is in the global heap. We say a [2] destroy the superblock and return block is local to a thread if the enclosing superblock belongs to a [3] S Å superblock containing b local heap associated with the processor on which the thread is [4] RCS { running. A local heap and the local free lists of the superblocks [5] If block b is not local, goto Slow it owns are CPU-specific data. A local heap is permanently [6] add b to S.LocalFreeList fixed to its processor. A superblock’s local free list is CPU- [7] } specific while the superblock remains online. Figure 1, below depicts the data structures used by LFMalloc. [8] return // fast path [9] Slow: LFMalloc uses restartable critical sections to operate on local [10] lock S // S is remote or offline free lists and to attach and detach superblocks from local heaps. [11] add b to S.RemoteFreeList Pseudo-code for malloc (sz) [12] unlock S [13] if the number of in-use blocks in S [1] // RCS {...} delineates a restartable [14] reaches 0 and the number of entirely free [2] // critical section [15] superblocks in global heap exceeds a [3] if sz is large [16] limit, [4] construct and return a large superblock [17] destroy the superblock s [5] // Compute size-class in constant-time [18] return [6] sc Å SizeClassRadixMap [sz/8] [7] self Å find thread-local storage block [8] for the current thread To free a local block, the free operator pushes the block onto the [9] Retry: superblock’s local free list. To free a non-local block, the free [10] RCS { // try to allocate locally operator must lock the superblock, add the block to the remote free list, and then unlock the superblock. [11] H Å LocalHeap[self.processorID] [12] S Å H[sc]

5 167

CPU #1 CPU #2 4.1 Methodology and Environment

Thread #1 Thread #2 Thread #3 All tests were performed on a 16-processor Sun E6500 system, Running Blocked Running with 400 MHz UltraSPARC-II processors. The system had 16 GB or RAM and was running Solaris 8. Each processor had a 16 KB L1 data cache, a 16KB L1 instruction cache and a 4 MB Thread-local Local heap for CPU #1 Local heap for CPU #2 storage SizeClass [] SizeClass [] direct-mapped L2 cache. All tests were run in Solaris’ timeshare (TS) scheduling class. All components, except those provided in binary form by the system, were compiled in 32-bit mode with GCC 2.95.3. We ran all tests with the liblwp libthread package (liblwp can be Superblock (online) Superblock (online) Superblock (online) found in /lib/lwp in Solaris 8). Liblwp uses a single-level threading model where all user-level threads are bound 1:1 to LWPs (an LWP is a light-weight , or kernel thread). In Solaris 9 this 1:1 threading model will become the default. Hoard version 2.1.0 hashes a thread’s LWP ID to locate the

Global Heap appropriate local heap. Hoard queries a thread’s current LWP SizeClass [] ID by calling the undocumented lwp_self function. In the default libthread on Solaris 8 and lower, lwp_self simply fetches the current LWP ID from thread-local storage. In liblwp Superblock (offline) Shaded blocks are on the superblock’s lwp_self calls the _lwp_self which is significantly free list slower than accessing thread-local storage. We adapted Hoard to call thr_self instead of lwp_self, avoiding the system call. This is benign, as under liblwp or Solaris 9 libthread, a thread’s

Superblock (offline) Superblock (offline) ID is always the same as its LWP ID. We built all C++ benchmarks with mallocwrap.cpp from Hoard. Mallocwrap replaces the __builtin operators with simple wrappers that call malloc and free, respectively. All malloc packages, including LFMalloc, were built as Figure 1. Data Structures used by LFMalloc dynamically linked libraries. We compared LFMalloc to Hoard 2.1.0, the default Solaris libc A malloc request that can be satisfied from the local heap is malloc, and mtmalloc. Mtmalloc, described as an “MT hot lock-free. Likewise, freeing a local block is lock-free. memory allocator”, is delivered in binary form with Solaris. Libc malloc is highly serialized – all operations are protected by Local heap metadata is accessed exclusively by one processor. a single mutex. The association between processors and local heaps remains fixed even as various threads run on the processors. This A standard suite of multithreaded malloc benchmarks does not supports our claim of good locality. Because superblocks yet exist. Instead, we recapitulate some of the micro-benchmark consist of a contiguous set of blocks, and a superblock stays results reported in previous papers. We use Larson, from [16] associated with a processor until it becomes empty, our design which can be found in the Hoard source release. Larson minimizes false sharing. simulates a server. We discovered that Larson scores are sensitive to scheduling policy. Worker threads start and make Malloc and free operations tend to occur on the same processor. progress outside the timing interval – before the main thread The only exceptions are when a thread migrates between the starts timing. If the main thread stalls, as can happen with a malloc and free, or when the allocating thread hands off the large number of worker threads, the Larson scores will be biased block to be freed by some other thread. This means that most upward. As a remedy we added a barrier in the beginning of the free operations are on local blocks and take the “fast path”, show worker’s code path. The main thread starts timing and only then in free, above. drops the barrier. All Larson scores reported in this paper reflect We believe the arguments in [3] regarding Hoard’s maximum the modified version. fragmentation and bounded blowup apply to LFMalloc as well. We also report results for linux-scalability [17] and Hoard’s own threadtest. Finally, we introduce our own micro-benchmark, 4. Results mmicro. In writing mmicro, we factored out influences such as scheduling decisions and fairness. The program measures the In this section we describe our results. We compare the scores throughput of a set of homogeneous threads – it does not of a set of malloc benchmarks, varying the number of threads directly measure latency. Given a fixed time interval, mmicro used in the benchmarks, and the underlying malloc package. measures the aggregate amount of work (malloc-free pairs) We also pick a single micro-benchmark, and, holding the accomplished in the allotted time. No threads are created or number of threads constant, perform a detailed analysis of the destroyed in the timing interval. The benchmark takes the execution of that program with various malloc implementations. number of threads, the timing interval, the object size and the capacity as arguments and reports the number of malloc-free

6 168

pairs/second. During the timing interval, each thread loops, This improves text sharing, but at the cost of space (for the mallocing capacity objects in succession, and then freeing the PLTs) and additional control transfers. We have measured a objects in the order in which they were allocated. round-trip call through a PLT at 35 nsecs, so the cost of control transfer begins to play a large part in LFMalloc performance. We used Solaris’ cputrack utility to collect data from the processor’s performance counters. Cputrack perturbs In Table 3, we explore which aspect of our algorithm – fast scheduling decisions. Mmicro scores are stable under cputrack. RCS-based synchronization or precise CPU-identity – Larson, linux-scalability and threadtest, however, report contributes the most to performance. We vary the different results under cputrack. synchronization mechanism used to protect the local free lists and the mechanism use to select a local heap and then report the 4.2 Speedup and Scalability results of mmicro runs at 1, 16 and 256 threads. For reference, the first row represents an unmodified LFMalloc allocator. The We ran Larson, threadtest, linux-scalability and mmicro on libc, nd mtmalloc, Hoard and LFMalloc, varying the number of 2 row reports results for LFMalloc modified to use a per-CPU concurrent threads from 1 though 16, 18, 20, 22, 24, 26, 28, 30, Solaris mutex to protect the local free lists. Threads still use the precise processor ID to locate the appropriate local heap. In the 32, 64, 128 and 256. The results are presented in Figure 2(a)- rd (d). All scores are normalized to libc – the Y-axis is given as 3 row we retain the per-CPU mutex but use a simple Hoard- the speedup multiple over libc. For all data points, LFMalloc style hash on the thread ID’s to select a local heap. We are was faster than libc, mtmalloc and Hoard. unable to combine MP-RCS-based synchronization with heap- selection-by-hashing as MP-RCS synchronization requires The performance drop for Larson, Figure 2(a), at higher precise CPU identity to safely access CPU-specific data. As seen numbers of threads may be due to scalability obstacles outside in the table, precise CPU-identity is critical to scalability. MP- the allocator. We modified Larson and removed the malloc-free RCS-based synchronization is helpful, but only improves operations from the loop of the worker thread. The modified performance about 2 times as compared to a traditional Solaris program had a peak score of 153 times libc at 24 threads. At mutex. higher thread counts the performance dropped off, just as was seen in the unmodified version. The heap “footprint” for LFMalloc is generally very close to that of Hoard’s. We ran threadtest at 1 through 16, 18, 20, 22, 24, In Figure 2(e), we show the scalability characteristics of the 26, 28, 30, 32, 64, 128 and 256 threads. Hoard’s peak heap various malloc packages by plotting the Y-axis as the speedup utilization ranged from 6.591 MB to 7.430 MB while relative to one thread. An ideal benchmark running on a LFMalloc’s peak heap utilization ranged from 4.980 MB to perfectly scalable allocator would have Speedup(t) = t, for t up 5.898 MB. to the number of processors. Both Hoard and LFMalloc show excellent scalability – concurrent allocator operations don’t In addition to scaling to multiple threads and CPUs, LFMalloc impede each other. demonstrates excellent latency for single-threaded applications. Mmicro running with a single thread can execute 4,408,632 4.3 Execution Characteristics malloc-free pairs per CPU-second on LFMalloc, where the malloc and free operations can be satisfied in the local heap. To better understand the performance differences seen between Hoard, mtmalloc and libc can execute 414,166; 363,840 and the various allocators we picked a single micro-benchmark, 388,121 malloc-free pairs per CPU-second, respectively. For mmicro, and a set of parameters and then ran on libc, mtmalloc, LFMalloc that rate holds even as we increase the number of Hoard and LFMalloc, using cputrack and other tools to gather threads and CPUs. detailed data. We present that information in Table 2, Execution Characteristics. As can be seen in the table, programs running Table 4 provides a breakdown of the proportion of local, remote on LFMalloc have excellent cache and interconnect and offline free operations encountered while running Larson, characteristics, supporting our previous claims. Notable is that linux-scalability, mmicro and threadtest at 16, 64 and 256 mmicro on LFMalloc runs at .909 cycles per instruction, threads. Of particular interest is threadtest. Threadtest divides realizing the potential of the UltraSPARC’s superscalar design. the number of objects, 300000, in our case, by the number of Care should be taken when examining the libc results. Under threads, computing the local work-load W. Each worker thread libc, mmicro was only able to achieve 14.1% CPU utilization – then loops, allocating W objects and then freeing those objects. it could only saturate about two of the 16 processors. Threads in As we increase the number of threads, W decreases. When W is the libc trial spent most of their time blocked on the single heap large, a larger proportion of free operations will be offline mutex. instead of local. This is reflected in Table 4: as the number of threads increases the proportion of local free operations The raw data translation lookaside buffer (DTLB) miss rate was increases and the proportion of offline free operations decreases. higher under LFMalloc than Hoard, but in the timing interval This can also be seen in Figure 2(b). Above 30 threads most LFMalloc executed nearly 10 times more malloc-free pairs. The free operations are local and take the fast path, resulting in better TLB miss rate per allocation pair is lower for LFMalloc. performance. The superblock size is a critical parameter - Calls to malloc and free from the benchmark programs pass naively, larger superblocks stay online longer and the odds that a through the PLT (Procedure Link Table). The PLT is free operation will be local increase with the superblock size. fundamental to cross-module calls and dynamic linking. When Local free operations, as they are lock-free, are faster than the run-time linker resolves cross-module references it patches offline free operations, which require acquiring and releasing a the PLT instead of the call sites. The call sites transfers control mutex. to the PLT and the PLT transfers control to the ultimate target.

7 169

(a) larson: speedup relative to libc (b) threadtest: speedup relative to libc 70 180 mtmalloc 60 160 mtmalloc hoard 140 hoard 50

LFMalloc c b

i 120 LFMalloc libc l er

40 er 100 ov ov X p 30 p X 80

60 speedu

20 speedu 40 10 20

0 0 1 3 5 7 9 1113151822263064256 1 3 5 7 9 1113151822263064256 threads threads

(c) linux-scalability: speedup relative to libc (d) mmicro: speedup relative to libc 700 800 mtmalloc mtmalloc 700 600 hoard hoard 600 500 LFMalloc LFMalloc c c b i

l 500 lib er er 400 ov ov 400 X p X p 300 300 speedu speedu 200 200

100 100

0 0 1 3 5 7 9 1113151822263064256 1 3 5 7 9 1113151822263064256 threads threads

(e) mmicro: speedup relative to one thread 18 libc 16 mtmalloc 14 hoard 1

= LFMalloc

t 12

10

8 dup X over e 6 spe 4

2

0 1 3 5 7 9 1113151822263064256 threads

Shaded areas indicate regions where the number of threads is equal to or exceeds the number of CPUs

Figure 2. Scalability and Speedup

170

BLOCKED ONPROC OFFPROC Other threads may run

Critical Execution Manager

t2 Kernel Kernel Driver

B User t3 Upcall A t1

t4 Restartable Critical Section D C

1: LD SelfÆCpu,C 2: LD CÆHead,H 3: LD HÆNext,N 4: ST N,CÆHead

A: no preemption - simple return B: preemption - potential interference C: not in Critical Section - simple return D: Interrupted Critical Section - restart

Figure 3. Control Flow for Restartable Critical Sections

Time Action t1 The thread enters the Restartable Critical Section. The critical section will attempt to pop a node off a CPU-specific linked-list. Self is a pointer to thread-local-storage and SelfÆCpu is a pointer to CPU-specific-data. The thread executes instructions #1 and #2 and is then interrupted – the interrupted user-mode instruction pointer (IP) points to instruction #3. Control passes into the kernel. The kernel deschedules the thread – the thread goes OFFPROC. Other related threads run, potentially rendering the values saved in H and C stale. t2 The thread is eventually rescheduled – it goes ONPROC. The kernel calls a routine in the driver, previously registered as a context switch callback, to notify the driver that a thread of interest is making the transition from ready state to running state. The driver checks to see if (a) the thread has migrated to a different CPU since it last ran, or, (b) if any related threads have run on this CPU since the current thread last went OFFPROC. We’ll presume (b) is the case, so there is a risk of interference. The driver posts an upcall – the driver saves the interrupted IP and substitutes the address of the notification routine. The thread eventually returns to user-mode, but restarts in the notification instead of at instruction #3. t3 The notification routine updates Self->Cpu to refer to the appropriate CPU (the thread may have migrated). Next, it retrieves the interrupted IP, which refers to instruction #3, and checks to see if the IP was within a critical section. In our example the routine adjusts the IP to point to the beginning of the critical section (instruction #1). For convenience, we currently write critical sections in a non-destructive idiom – it does not destroy any input register arguments. If a critical section is interrupted the critical execution manager can simply restart it. The notification routine also checks for restartable critical sections that have been interrupted by signals and arranges for them to restart as well. t4 The upcall unwinds, passing control to the start of the interrupted critical section, causing it to restart.

Table 1. Example of an Interrupted Critical Section

9 171

.

Table 2. Execution Characteristics

libc mtmalloc Hoard LFMalloc

# MP-RCS migration upcalls 473

# MP-RCS preemption upcalls 2705

# MP-RCS of interrupted critical sections 812

Peak heap size (Mb) unknown unknown 3.5 Mb 4.1 Mb

Average malloc-free latency (ηsecs per pair) 166,525 ηsecs 19,230 ηsecs 2,251 ηsecs 227 ηsecs

Overall malloc-free throughput (pairs/sec) 102,111 844,454 6,589,446 70,749,108

User-mode CPU utilization (%) 14.1% 85.6% 98.6% 98.7%

Context switch rate (switches/second) Includes both voluntary switches caused by 1043 8537 657 412 blocking and involuntary switches

User-mode cycles per instruction (CPI) 1.985 3.106 1.505 .909

User-mode snoops (cycles/second) - snoop invalidations caused by cache line 3,529,464/sec 22,484,855/sec 137,625/sec 41,452/sec migration

User-mode Snoops / malloc-free pair 33.382 26.112 0.021037 0.000583

System DTLB miss rate (exceptions/second) 4,965/sec 20,8114/sec 4,870/sec 10,028/sec

User-mode L1 cache miss rate 2.544% 6.668% 7.325% 1.699%

User-mode L2 cache miss rate 33.258% 24.139% 0.015% 0.008%

User-mode L1 cache misses / malloc-free pair 139.087 36.022 10.333 .407

Execution characteristics of “mmicro 64 :10000 200” – a 10 second run with 64 threads, each thread loops, malloc()ing 64 blocks of length 200 and then free()s the blocks in same order in which they were allocated.

Table 3. Contributions of Synchronization and Precise Processor Identity

Synchronization mechanism Local heap selection 1 thread 16 threads 256 threads Restartable Critical Sections By processor ID via upcall 4548 69911 72163 Per-CPU Solaris mutex By processor ID via upcall 2124 33020 30678 Per-CPU Solaris mutex Hoard-style hash of thread ID 2012 23742 2060 “mmicro T :10000 64” scores - malloc-free pairs/msec – for 1, 16 and 256 threads showing the effect of different synchronization and heap selection mechanisms.

10 172

We have prototyped a simple free-block cache that “stacks” on Table 4. Breakdown of free() Requests top an underlying malloc subsystem. Each processor has an Benchmark Threads %local %remote %offline array of local free lists, indexed by size-class. Malloc attempts to unlink a block from the local free list. If no block is Larson 16 37% 17% 49% available, malloc resorts to the underlying malloc subsystem. Larson 64 11% 16% 73% Likewise, free attempts to add a block to the local free list. If the local list has reached its maximum capacity, the free Larson 256 6% 9% 85% operator returns the block to the underlying subsystem. All mmicro 16 100% 0% 0% blocks are originally allocated from the underlying malloc. This mmicro 64 100% 0% 0% scheme does not use superblocks and the free operator does not distinguish between local, free and remote blocks. If a thread mmicro 256 100% 0% 0% loops, allocating N blocks and freeing N blocks where N is Linux-scalability 16 100% 0% 0% greater than the local capacity, the stacked allocator degenerates Linux-scalability 64 100% 0% 0% to the speed of the underlying malloc subsystem. Despite that flaw, it often performs well. We layered a simple cache on top Linux-scalability 256 100% 0% 0% the libc malloc subsystem and, at 256 threads, observed Larson threadtest 16 13% 3% 84% scores of 132 times that of libc. Recall that the ideal speedup for threadtest 64 70% 0% 30% Larson, with a zero-cost allocator would be 153 times libc. Threadtest, however, degraded to only 1.15 the speed of libc at 1 threadtest 256 93% 0% 7% thread. In addition, blocks can migrate between CPUs more rapidly than in LFMalloc, and so the “stacked” allocator can induce passive false-sharing. Still, we believe there may be 5. Conclusions some merit to adding a cache layer on top of LFMalloc. We have presented a malloc package, LFMalloc, that is scalable, has extremely low latency, and has an excellent memory 5.2 Future Work footprint. Compared to other malloc implementations, like By creating superblocks of size B, (B is a power of two) aligned Hoard, we have demonstrated that LFMalloc is significantly on virtual addresses that are a multiple of B, we can eliminate better at scaling both with the number of processors and with the the block header, saving 8 bytes per-block and improving the number of threads. overall footprint. Given a block, the free operator can locate the LFMalloc uses two key services: precise processor awareness, enclosing superblock by simply masking the address of the and scheduler-sensitive restartable critical sections. While block. restartable critical sections require precise processor awareness, To reduce the frequency of “false positive” upcalls, we intend to the converse is not true. We have shown that of these, precise establish a convention so the kernel can determine if an processor location awareness is more useful. However, the interrupted thread was executing in a critical section or was in combination allows our allocator to manage shared per-CPU the middle of a transaction, and, optionally, determine if other heaps in a largely lock-free manner, both reducing latency and threads passed through the critical section since the thread was improving scalability. last ONPROC. We are exploring other mechanisms, beyond We have also introduced a facility, Multi-Processor Restartable instruction pointer range checks, to identify critical sections. Critical Sections (MP-RCS), which provides these two services: We have adapted a Java™ virtual machine to operate on per- precise processor awareness and restartable critical sections. CPU heap allocation buffers with MP-RCS operators. In MP-RCS is novel in that it works on multiprocessors allowing addition to heap allocation, we believe that MP-RCS can offer the efficient management of CPU-specific data. Other kernel- fast logging write barriers, and the potential for fast mutator assisted synchronization techniques work only on uniprocessors. thread suspension, needed by stop-the-world garbage collectors. Finally, the use of the MP-RCS facility does not impose any (A thread targeted for suspension would block itself in the limits on kernel preemption policies in the operating system, on upcall, as it comes ONPROC). the threading libraries, or on the applications. The deferred free list – used by non-local free operations – is 5.1 Other Results currently protected with a mutex. We have prototyped a design For uniprocessors and for a process where all related threads are that makes operations on the deferred free list lock-free. It uses bound to one CPU, restartable critical sections can emulate compare-and-swap, but employs a bounded-tag [21] device to CAS; all data is CPU-specific. Our results show the emulated avoid A-B-A problems. form to be 1.8 times faster than the actual hardware instruction. Finally, we would like to collect data on NUMA systems – We have extended MP-RCS to support multi-word transactions LFMalloc should perform well in that environment via commit records [14]. A thread logs its intended updates into 5.3 Acknowledgements a commit record and then publishes the log. If a commit operation is interrupted, other threads will help finish the We would like to thank Chris Phillips for pointing out IBM’s pending updates before starting potentially conflicting use of function recovery routines and service requests. We transactions. We intend to incorporate this facility in subsequent would also like to thank Paula J. Bishop, Ole Agesen and the versions of LFMalloc. anonymous reviewers for helpful comments.

11 173

6. References [18] Jim Maura and Richard McDougall. Solaris™ Internals: Core Kernel Architecture. Sun Microsystems Press. Prentice-Hall. [1] T. Anderson, B. Bershad, E. Lazowska and H. Levy. Scheduler 2001. Activations: Effective Kernel Support for User-Level Management [19] Paul McKenney and John Slingwine. Read-Copy Update: Using of Parallelism. ACM Transactions on Computer Systems, 10(1). th 1992. Execution History to Solve Concurrency Problems. In 10 IASTED International Conference on Parallel and Distributed [2] Alan Bawden. PCLSRing: Keeping Process State Modular. Computing Systems. (PDCS’98). 1998 Available at ftp://ftp.ai.mit.edu/pub/alan/pclsr.memo. 1993. [20] Paul McKenney, Jack Slingwine and Phil Krueger. Experience [3] Emery Berger, Kathryn McKinley, Robert Blumofe and Paul with a Parallel Memory Allocator. In Software – Practice & Wilson. Hoard: A Scalable Memory Allocator for Multithreaded Experience. Vol. 31. 2001. Applications. In ASPLOS-IX: Ninth International Conference on [21] Mark Moir. Practical Implementations of Non-Blocking Architectural Support for Programming Languages and Operating th Systems. 1997. Synchronization Primitives. In Proc. of the 16 ACM Symposium on the Principles of . (PODC) 1997. [4] Brian N. Bershad. Fast Mutual Exclusion for Uniprocessors. In ASPLOS-V: Fifth International Conference on Architectural [22] William Moran and Farnham Jahanian. Cheap Mutual Exclusion. Support for Programming Languages and Operating Systems. In Proc. USENIX Technical Conference. 1992. 1992. [23] David Mosberger, Peter Druschel and Larry L. Peterson. A Fast [5] Brian N. Bershad. Practical Considerations for Non-Blocking and General Software Solution to Mutual Exclusion on Concurrent Objects. In Proc. International Conference on Uniprocessors. Technical Report 94-07, Department of Computer Distributed Computing Systems, (ICDCS). May 1993. Science, University of Arizona. June 1994. [6] Hans-J. Boehm. Fast Multiprocessor Memory Allocation and [24] David Mosberger, Peter Druschel and Larry L. Peterson. Garbage Collection. HP Labs Technical Report HPL-2000-165. Implementing Atomic Sequences on Uniprocessors Using 2000. Rollforward. In Software – Practice & Experience. Vol. 26, No. 1. January 1996. [7] Jeff Bonwick and Jonathan Adams. Magazines and Vmem: Extending the Slab Allocator to Many CPUs and Arbitrary [25] E. Moss and W. H. Kohler. Concurrency Features for the Resources. In Proc. USENIX Technical Conference. 2001. trellis/owl Language. In European Conference on Object-Oriented Programming 1987 (ECOOP’87). [8] Ben Gamsa, Orran Krieger, Jonathan Appavoo and Michael Stumm. Tornado: Maximizing Locality and Concurrency in a [26] Greg Nakhimovsky. Improving Scalability of Multithreaded Shared Memory Multiprocessor Operating System. In Proc. of Dynamic Memory Allocation. In Dr. Dobbs Journal, #326. July Symp. On Operating System Design and Implementation. (OSDI- 2001. III). 1999. [27] O. Shivers, J. Clark and R. McGrath. Atomic Heap Transactions [9] Wolfram Golger. Dynamic Memory Allocator Implementations in and Fine-grain Interrupts. In Proc. International Conference on Linux System Binaries. Available at www.dent.med.uni- (ICFP). 1999. muenchen.de/~wmglo/malloc-slides.html. Site visited January [28] Christopher Small and Margo Seltzer. Scheduler Activations on 2002. BSD: Sharing Thread Management State Between Kernel and [10] Michael Greenwald. Ph. D. Thesis. Non-Blocking Application. Harvard Computer Systems Laboratory Technical Synchronization and System Design. Stanford University, 1999. Report TR-31-95. 1995 [11] Maurice Herlihy. A Method for Implementing Highly Concurrent [29] Hiroaki Takada and Ken Sakamura. Real-Time Synchronization Data Objects. ACM Transactions on Programming Languages and Protocols with Abortable Critical Sections. In Proc. of the First Systems (TOPLAS) 15(5), November 1993. Workshop on Real-Time Systems and Applications. (RTCSA). 1994. [12] Richard L. Hudson, J. Eliot B. Moss, Sreenivas Subramoney and Weldon Washburn. Cycles to Recycle: Garbage Collection on the [30] John Valois. Lock-Free Data Structures. Ph. D. Thesis, IA-64. In Tony Hoskings, editor, ISMM 2000, Proc. Second Rensselaer Polytechnic Institute, 1995. International Symposium on Memory Management, 36(1). of the [31] Voon-Yee Vee and Wen-Jing Hsu. A Scalable and Efficient ACM SIGPLAN Notices. 2000. Storage Allocator on Shared-Memory Multiprocessors. In [13] IBM OS/390 MVS Programming: Resource Recovery. 1998. International Symp. of Parallel Architectures, Algorithms, and GC28-1739-03. Networks (I-SPAN 99). 1999. [14] Theodore Johnson and Krishna Harathi. Interruptible Critical [32] Paul R. Wilson, Mark S. Johnstone, Michael Neeley and David Sections. Dept. of Computer Science, University of Florida. Boles. Dynamic Storage Allocation: A Survey and Critical Technical Report TR94-007. 1994. Review. In Proc. International Workshop on Memory Management, 1995. [15] L. I. Kontothanassis, R. W. Wisniewski, and M. L. Scott. Scheduler-Conscious Synchronization. ACM Trans. on Computer Sun, Sun Microsystems, Java, JDK and Solaris are trademarks or Systems, February 1997 registered trademarks of Sun Microsystems, Inc. in the United States and other countries. All SPARC trademarks are used under license, and are [16] P. Larson and M. Krishnan. Memory Allocation for Long- trademarks or registered trademarks of SPARC International, Inc. in the Running Server Applications. In International Symp. On Memory United States and other countries. Products bearing the SPARC Management (ISMM 98). 1988. trademark are based on an architecture developed by Sun Microsystems, [17] Chuck Lever and David Boreham. Malloc() Performance in a Inc. Multithreaded Linux Environment. In USENIX Technical Conference, 2000.

12 174