Corey: an Operating System for Many Cores

Corey: An Operating System for Many Cores Silas Boyd-Wickizer∗ Haibo Chen† Rong Chen† Yandong Mao† Frans Kaashoek∗ Robert Morris∗ Aleksey Pesterev∗ Lex Stein‡ Ming Wu‡ Yuehua Dai◦ Yang Zhang∗ Zheng Zhang‡ ∗MIT †Fudan University ‡Microsoft Research Asia ◦Xi’an Jiaotong University ABSTRACT total application run time on two cores [12]. Veal and Foong show that as a Linux Web server uses more cores Multiprocessor application performance can be limited directory lookups spend increasing amounts of time con- by the operating system when the application uses the tending for spin locks [29]. Section 8.5.1 shows that operating system frequently and the operating system contention for Linux address space data structures causes services use data structures shared and modified by mul- the percentage of total time spent in the reduce phase of tiple processing cores. If the application does not need a MapReduce application to increase from 5% at seven the sharing, then the operating system will become an cores to almost 30% at 16 cores. unnecessary bottleneck to the application’s performance. One source of poorly scaling operating system ser- This paper argues that applications should control vices is use of data structures modified by multiple cores. sharing: the kernel should arrange each data structure Figure 1 illustrates such a scalability problem with a sim- so that only a single processor need update it, unless ple microbenchmark. The benchmark creates a number directed otherwise by the application. Guided by this of threads within a process, each thread creates a file de- design principle, this paper proposes three operating scriptor, and then each thread repeatedly duplicates (with system abstractions (address ranges, kernel cores, and dup) its file descriptor and closes the result. The graph shares) that allow applications to control inter-core shar- shows results on a machine with four quad-core AMD ing and to take advantage of the likely abundance of Opteron chips running Linux 2.6.25. Figure 1 shows cores by dedicating cores to specific operating system that, as the number of cores increases, the total number of functions. dup and close operations per unit time decreases. The Measurements of microbenchmarks on the Corey pro- cause is contention over shared data: the table describ- totype operating system, which embodies the new ab- ing the process’s open files. With one core there are no stractions, show how control over sharing can improve cache misses, and the benchmark is fast; with two cores, performance. Application benchmarks, using MapRe- the cache coherence protocol forces a few cache misses duce and a Web server, show that the improvements can per iteration to exchange the lock and table data. More be significant for overall performance: MapReduce on generally, only one thread at a time can update the shared Corey performs 25% faster than on Linux when using file descriptor table (which prevents any increase in per- 16 cores. Hardware event counters confirm that these formance), and the increasing number of threads spin- improvements are due to avoiding operations that are ex- ning for the lock gradually increases locking costs. This pensive on multicore machines. problem is not specific to Linux, but is due to POSIX se- mantics, which require that a new file descriptor be vis- 1INTRODUCTION ible to all of a process’s threads even if only one thread Cache-coherent shared-memory multiprocessor hard- uses it. ware has become the default in modern PCs as chip man- Common approaches to increasing scalability include ufacturers have adopted multicore architectures. Chips avoiding shared data structures altogether, or designing with four cores are common, and trends suggest that them to allow concurrent access via fine-grained locking chips with tens to hundreds of cores will appear within or wait-free primitives. For example, the Linux com- five years [2]. This paper explores new operating system munity has made tremendous progress with these ap- abstractions that allow applications to avoid bottlenecks proaches [18]. in the operating system as the number of cores increases. A different approach exploits the fact that some in- Operating system services whose performance scales stances of a given resource type need to be shared, while poorly with the number of cores can dominate applica- others do not. If the operating system were aware of an tion performance. Gough et al. show that contention for application’s sharing requirements, it could choose re- Linux’s scheduling queues can contribute significantly to source implementations suited to those requirements. A USENIX Association 8th USENIX Symposium on Operating Systems Design and Implementation 43 mentable without inter-core sharing by default, but allow 3000 sharing among cores as directed by applications. 2500 We have implemented these abstractions in a prototype operating system called Corey. Corey is organized 2000 like an exokernel [9] to ease user-space prototyping of 1500 experimental mechanisms. The Corey kernel provides the above three abstractions. Most higher-level services 1000 are implemented as library operating systems that use the 500 abstractions to control sharing. Corey runs on machines 1000s of dup + close per second with AMD Opteron and Intel Xeon processors. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Several benchmarks demonstrate the benefits of the Cores new abstractions. For example, a MapReduce applica- Figure 1: Throughput of the file descriptor dup and close mi- tion scales better with address ranges on Corey than on crobenchmark on Linux. Linux. A benchmark that creates and closes many short TCP connections shows that Corey can saturate the net- limiting factor in this approach is that the operating sys- work device with five cores by dedicating a kernel core tem interface often does not convey the needed informa- to manipulating the device driver state, while 11 cores tion about sharing requirements. In the file descriptor are required when not using a dedicated kernel core. A example above, it would be helpful if the thread could synthetic Web benchmark shows that Web applications indicate whether the new descriptor is private or usable can also benefit from dedicating cores to data. by other threads. In the former case it could be created This paper should be viewed as making a case for con- in a per-thread table without contention. trolling sharing rather than demonstrating any absolute This paper is guided by a principle that generalizes conclusions. Corey is an incomplete prototype, and thus the above observation: applications should control shar- it may not be fair to compare it to full-featured operating ing of operating system data structures. “Application” systems. In addition, changes in the architecture of fu- is meant in a broad sense: the entity that has enough ture multicore processors may change the attractiveness information to make the sharing decision, whether that of the ideas presented here. be an operating system service, an application-level li- The rest of the paper is organized as follows. Using brary, or a traditional user-level application. This prin- architecture-level microbenchmarks, Section 2 measures ciple has two implications. First, the operating system the cost of sharing on multicore processors. Section 3 should arrange each of its data structures so that by de- describes the three proposed abstractions to control shar- fault only one core needs to use it, to avoid forcing ing. Section 4 presents the Corey kernel and Section 5 unwanted sharing. Second, the operating system inter- its operating system services. Section 6 summarizes the faces should let callers control how the operating system extensions to the default system services that implement shares data structures between cores. The intended re- MapReduce and Web server applications efficiently. Sec- sult is that the operating system incurs sharing costs (e.g., tion 7 summarizes the implementation of Corey. Sec- cache misses due to memory coherence) only when the tion 8 presents performance results. Section 9 reflects on application designer has decided that is the best plan. our results so far and outlines directions for future work. This paper introduces three new abstractions for ap- Section 10 relates Corey to previous work. Section 11 plications to control sharing within the operating system. summarizes our conclusions. Address ranges allow applications to control which parts of the address space are private per core and which are 2MULTICORE CHALLENGES shared; manipulating private regions involves no con- The main goal of Corey is to allow applications to scale tention or inter-core TLB invalidations, while explicit in- well with the number of cores. This section details some dication of shared regions allows sharing of hardware hardware obstacles to achieving that goal. page tables and consequent minimization of soft page Future multicore chips are likely to have large total faults (page faults that occur when a core references amounts of on-chip cache, but split up among the many pages with no mappings in the hardware page table but cores. Performance may be greatly affected by how well which are present in physical memory). Kernel cores al- software exploits the caches and the interconnect be- low applications to dedicate cores to run specific kernel tween cores. For example, using data in a different core’s functions, avoiding contention over the data those func- cache is likely to be faster than fetching data from mem- tions use. Shares are lookup tables

Load more