Baumann: the Multikernel: a New OS Architecture for Scalable Multicore
Total Page:16
File Type:pdf, Size:1020Kb
The Multikernel: A new OS architecture for scalable multicore systems Andrew Baumann,∗ Paul Barham,y Pierre-Evariste Dagand,z Tim Harris,y Rebecca Isaacs,y Simon Peter,∗ Timothy Roscoe,∗ Adrian Schüpbach,∗ and Akhilesh Singhania∗ ∗Systems Group, ETH Zurich yMicrosoft Research, Cambridge zENS Cachan Bretagne Abstract App App App App Commodity computer systems contain more and more OS node OS node OS node OS node processor cores and exhibit increasingly diverse archi- Agreement algorithms State State State Async messages State tectural tradeoffs, including memory hierarchies, inter- replica replica replica replica connects, instruction sets and variants, and IO configu- Arch-specific code rations. Previous high-performance computing systems have scaled in specific cases, but the dynamic nature of Heterogeneous x86 x64 ARM GPU modern client and server workloads, coupled with the cores impossibility of statically optimizing an OS for all work- Interconnect loads and hardware variants pose serious challenges for operating system structures. Figure 1: The multikernel model. We argue that the challenge of future multicore hard- ware is best met by embracing the networked nature of the machine, rethinking OS architecture using ideas from Such hardware, while in some regards similar to ear- distributed systems. We investigate a new OS structure, lier parallel systems, is new in the general-purpose com- the multikernel, that treats the machine as a network of puting domain. We increasingly find multicore systems independent cores, assumes no inter-core sharing at the in a variety of environments ranging from personal com- lowest level, and moves traditional OS functionality to puting platforms to data centers, with workloads that are a distributed system of processes that communicate via less predictable, and often more OS-intensive, than tradi- message-passing. tional high-performance computing applications. It is no We have implemented a multikernel OS to show that longer acceptable (or useful) to tune a general-purpose the approach is promising, and we describe how tradi- OS design for a particular hardware model: the deployed tional scalability problems for operating systems (such hardware varies wildly, and optimizations become obso- as memory management) can be effectively recast using lete after a few years when new hardware arrives. messages and can exploit insights from distributed sys- tems and networking. An evaluation of our prototype on Moreover, these optimizations involve tradeoffs spe- multicore systems shows that, even on present-day ma- cific to hardware parameters such as the cache hierarchy, chines, the performance of a multikernel is comparable the memory consistency model, and relative costs of lo- with a conventional OS, and can scale better to support cal and remote cache access, and so are not portable be- future hardware. tween different hardware types. Often, they are not even applicable to future generations of the same architecture. Typically, because of these difficulties, a scalability prob- 1 Introduction lem must affect a substantial group of users before it will receive developer attention. Computer hardware is changing and diversifying faster We attribute these engineering difficulties to the ba- than system software. A diverse mix of cores, caches, in- sic structure of a shared-memory kernel with data struc- terconnect links, IO devices and accelerators, combined tures protected by locks, and in this paper we argue for with increasing core counts, leads to substantial scalabil- rethinking the structure of the OS as a distributed sys- ity and correctness challenges for OS designers. tem of functional units communicating via explicit mes- 1 sages. We identify three design principles: (1) make all In this section, we argue that the challenges facing a inter-core communication explicit, (2) make OS structure general-purpose operating system on future commodity hardware-neutral, and (3) view state as replicated instead hardware are different from those associated with tra- of shared. ditional ccNUMA and SMP machines used for high- The model we develop, called a multikernel (Figure 1), performance computing. In a previous paper [8] we ar- is not only a better match to the underlying hardware gued that single computers increasingly resemble net- (which is networked, heterogeneous, and dynamic), but worked systems, and should be programmed as such. allows us to apply insights from distributed systems to We rehearse that argument here, but also lay out addi- the problems of scale, adaptivity, and diversity in operat- tional scalability challenges for general-purpose system ing systems for future hardware. software. Even on present systems with efficient cache-coherent shared memory, building an OS using message-based rather than shared-data communication offers tangible 2.1 Systems are increasingly diverse benefits: instead of sequentially manipulating shared data structures, which is limited by the latency of re- A general-purpose OS today must perform well on an mote data access, the ability to pipeline and batch mes- increasingly diverse range of system designs, each with sages encoding remote operations allows a single core different performance characteristics [60]. This means to achieve greater throughput and reduces interconnect that, unlike large systems for high-performance comput- utilization. Furthermore, the concept naturally accom- ing, such an OS cannot be optimized at design or im- modates heterogeneous hardware. plementation time for any particular hardware configura- The contributions of this work are as follows: tion. • We introduce the multikernel model and the design To take one specific example: Dice and Shavit show principles of explicit communication, hardware- how a reader-writer lock can be built to exploit the neutral structure, and state replication. shared, banked L2 cache on the Sun Niagara proces- sor [40], using concurrent writes to the same cache line • We present a multikernel, Barrelfish, which ex- to track the presence of readers [20]. On Niagara this is plores the implications of applying the model to a highly efficient: the line remains in the L2 cache. On a concrete OS implementation. traditional multiprocessor, it is highly inefficient: the line ping-pongs between the readers’ caches. • We show through measurement that Barrelfish sat- isfies our goals of scalability and adaptability to This illustrates a general problem: any OS design tied hardware characteristics, while providing compet- to a particular synchronization scheme (and data lay- itive performance on contemporary hardware. out policy) cannot exploit features of different hardware. Current OS designs optimize for the common hardware In the next section, we survey trends in hardware which case; this becomes less and less efficient as hardware be- further motivate our rethinking of OS structure. In Sec- comes more diverse. Worse, when hardware vendors in- tion 3, we introduce the multikernel model, describe its troduce a design that offers a new opportunity for op- principles, and elaborate on our goals and success crite- timization, or creates a new bottleneck for existing de- ria. We describe Barrelfish, our multikernel implemen- signs, adapting the OS to the new environment can be tation, in Section 4. Section 5 presents an evaluation prohibitively difficult. of Barrelfish on current hardware based on how well it Even so, operating systems are forced to adopt in- meets our criteria. We cover related and future work in creasingly complex optimizations [27, 46, 51, 52, 57] in Sections 6 and 7, and conclude in Section 8. order to make efficient use of modern hardware. The re- cent scalability improvements to Windows7 to remove 2 Motivations the dispatcher lock touched 6000 lines of code in 58 files and have been described as “heroic” [58]. The Most computers built today have multicore processors, Linux read-copy update implementation required numer- and future core counts will increase [12]. However, com- ous iterations due to feature interaction [50]. Backport- mercial multiprocessor servers already scale to hundreds ing receive-side-scaling support to Windows Server 2003 of processors with a single OS image, and handle ter- caused serious problems with multiple other network abytes of RAM and multiple 10Gb network connections. subsystems including firewalls, connection-sharing and Do we need new OS techniques for future multicore even Exchange servers1. hardware, or do commodity operating systems simply need to exploit techniques already in use in larger multi- 1See Microsoft Knowledge Base articles 927168, 927695 and processor systems? 948496. 2 12 CPU CPU CPU CPU SHM8 PCIe 0 2 4 6 SHM4 L1 L1 L1 L1 SHM2 10 SHM1 L2 L2 L2 L2 MSG8 MSG1 PCIe 1 3 5 7 Server L3 8 1000) HyperTransport links × RAM RAM 6 Figure 2: Node layout of an 8×4-core AMD system 4 Latency (cycles 2 2.2 Cores are increasingly diverse 0 2 4 6 8 10 12 14 16 Diversity is not merely a challenge across the range of Cores commodity machines. Within a single machine, cores can vary, and the trend is toward a mix of different cores. Figure 3: Comparison of the cost of updating shared state Some will have the same instruction set architecture using shared memory and message passing. (ISA) but different performance characteristics [34, 59], since a processor with large cores will be inefficient for readily parallelized programs, but one using only small chines and become substantially more important for per- cores will perform poorly on the sequential parts of a formance than at present. program [31, 42]. Other cores have different ISAs for specialized functions [29], and many peripherals (GPUs, network interfaces, and other, often FPGA-based, spe- 2.4 Messages cost less than shared memory cialized processors) are increasingly programmable. In 1978 Lauer and Needham argued that message- Current OS designs draw a distinction between passing and shared-memory operating systems are duals, general-purpose cores, which are assumed to be homo- and the choice of one model over another depends on geneous and run a single, shared instance of a kernel, the machine architecture on which the OS is built [43].