Extreme High Performance Computing Or Why Microkernels Suck
Total Page:16
File Type:pdf, Size:1020Kb
Extreme High Performance Computing or Why Microkernels Suck Christoph Lameter sgi [email protected] Abstract difficulties seemed to shift to other areas having more to do with the limitation of the hardware and firmware. One often wonders how well Linux scales. We fre- We were then able to further double the processor count quently get suggestions that Linux cannot scale because to two thousand and finally four thousand processors it is a monolithic operating system kernel. However, mi- and we were still encountering only minor problems that crokernels have never scaled well and Linux has been were easily addressed. We expect to be able to handle scaled up to support thousands of processors, terabytes 16k processors in the near future. of memory and hundreds of petabytes of disk storage which is the hardware limit these days. Some of the As the number of processors grew so did the amount of techniques used to make Linux scale were per cpu ar- memory. In early 2007, machines are deployed with 8 eas, per node structures, lock splitting, cache line op- terabytes of main memory. Such a system with a huge timizations, memory allocation control, scheduler opti- amount of memory and a large set of processors creates mizations and various other approaches. These required the high performance capabilities in a traditional Unix significant detail work on the code but no change in the environment that allows for the running of traditional general architecture of Linux. applications, avoiding major efforts to redesign the ba- sic logic of the software. Competing technologies, such The presentation will give an overview of why Linux as compute clusters, cannot offer such an environment. scales and shows the hurdles microkernels would have Clusters consist of many nodes that run their own oper- to overcome in order to do the same. The presentation ating systems whereas scaling up a monolithic operating will assume a basic understanding of how operating sys- system has a single address space and a single operating tems work and familiarity with what functions a kernel system. The challenge in clustered environments is to performs. redesign the applications so that processing can be done concurrently on nodes that only communicate via a net- work. A large monolithic operating system with lots of 1 Introduction processors and memory is easier to handle since pro- cesses can share memory which makes synchronization Monolithic kernels are in wide use today. One wonders via Unix shared memory possible and data exchange though how far a monolithic kernel architecture can be simple. scaled, given the complexity that would have to be man- aged to keep the operating system working reliably. We Large scale operating systems are typically based on ourselves were initially skeptical that we could go any Non-Uniform Memory Architecuture (NUMA) technol- further when we were first able to run our Linux kernel ogy. Some memory is nearer to a processor than other with 512 processors because we encountered a series of memory that may be more distant and more costly to ac- scalability problems that were due to the way the op- cess. Memory locality in such a system determines the erating system handled the unusually high amounts of overall performance of the applications. The operating processes and processors. However, it was possible to system has a role in that context of providing heuristics address the scalability bottlenecks with some work us- in order to place the memory in such a way that memory ing a variety of synchronization methods provided by latencies are reduced as much as possible. These have to the operating system and we were then surprised to only be heuristics because the operating system cannot know encounter minimal problems when we later doubled the how an application will access allocated memory in the number of processors to 1024. At that point the primary future. The access patterns of the application should ide- • 251 • 252 • Extreme High Performance Computing or Why Microkernels Suck ally determine the placement of data but the operating nication between different components must be con- system has no way of predicting application behavior. trolled through an Inter Process Communication mech- A new memory control API was therefore added to the anism that incurs similar overhead to a system call in operating system so that applications can set up memory monolithic kernel. Typically microkernels use message allocation policies to guide the operating system in allo- queues to communicate between different components. cating memory for the application. The notion of mem- In order to communicate between two components of a ory allocation control is not standardized and so most microkernel the following steps have to be performed: contemporary programming languages have no ability to manage data locality on their own.1 Libraries need to be provided that allow the application to provide infor- 1. The originating thread in the context of the orig- mation to the operating system about desirable memory inating component must format and place the re- allocation strategies. quest (or requests) in a message queue. One idea that we keep encountering in discussions of 2. The originating thread must somehow notify the these large scale systems is that Micro-kernels should destination component that a message has arrived. allow us to handle the scalability issues in a better way Either interrupts (or some other form of signal- and that they may actually allow a better designed sys- ing) are used or the destination component must be tem that is easier to scale. It was suggested that a micro- polling its message queue. kernel design is essential to manage the complexity of the operating systems and ensure its reliable operation. 3. It may be necessary for the originating thread to We will evaluate that claim in the following sections. perform a context switch if there are not enough processors around to continually run all threads (which is common). 2 Micro vs. Monolithic Kernel 4. The destination component must now access the Microkernels allow the use of the context control primi- message queue and interpret the message and then tives of the processor to isolate the various components perform the requested action. Then we potentially of the operating system. This allows a fine grained de- have to redo the 4 steps in order to return the result sign of the operating system with natural APIs at the of the request to the originating component. boundaries of the subsystems. However, separate ad- dress spaces require context switches at the boundaries which may create a significant overhead for the proces- A monolithic operating system typically uses function sors. Thus many micro kernels are compromises be- calls to transfer control between subsystems that run in tween speed and the initially envisioned fine grained the same operating system context: structure (a hybrid approach). To some extent that prob- lem can be overcome by developing a very small low 1. Place arguments in processor registers (done by the level kernel that fits into the processor cache (for exam- compiler). ple L4) [11], but then we no longer have an easily pro- grammable and maintainable operating system kernel. 2. Call the subroutine. A monolithic kernel usually has a single address space and all kernel components are able to access memory 3. Subroutine accesses registers to interpret the re- without restriction. quest (done by compiler). 4. Subroutine returns the result in another register. 2.1 IPC vs. function call Context switches have to be performed in order to iso- From the description above it is already evident that the late the components of a microkernel. Thus commu- monolithic operating system can rely on much lower level processor components than the microkernel and 1There are some encouraging developments in this area with Unified Parallel C supporting locality information in the language is well supported by existing languages used to code itself. See Tarek, [2]. for operating system kernels. The microkernel has to 2007 Linux Symposium, Volume One • 253 manipulate message queues which are higher level con- However, the isolation will introduce additional com- structs and—unlike registers—cannot be directly modi- plexities. Operating systems usually service applica- fied and handled by the processor.2 tions running on top of them. The operating system must track the state of the application. A failure of one In a large NUMA system an even more troubling issue key component typically includes also the loss of rele- arises: while a function call uses barely any memory vant state information about the application. Some mi- on its own (apart from the stack), a microkernel must crokernel components that track memory use and open place the message into queues. That queue must have a files may be so essential to the application that the ap- memory address. The queue location needs to be care- plication must terminate if either of these components fully chosen in order to make the data accessible in a fast fails. If one wanted to make a system fail safe in a mi- way to the other operating system component involved crokernel environment then additional measures, such in the message transfer. The complexity of making the as check pointing, may have to be taken in order to determination where to allocate the message queue will guarantee that the application can continue. However, typically be higher than the message handling overhead the isolation of the operating system state into different itself since such a determination will involve consult- modules will make it difficult to track the overall system ing system tables to figure out memory latencies. If state that needs to be preserved in order for check point- the memory latencies are handled by another compo- ing to work.