Seminar HPC Trends Winter Term 2017/2018 New Operating System Concepts for High Performance Computing

Seminar HPC Trends Winter Term 2017/2018 New Operating System Concepts for High Performance Computing Fabian Dreer Ludwig-Maximilians UniversitätMünchen [email protected] January 2018 Abstract 1 The Impact of System Noise When using a traditional operating system kernel When running large-scale applications on clusters, in high performance computing applications, the the noise generated by the operating system can cache and interrupt system are under heavy load by greatly impact the overall performance. In order to e.g. system services for housekeeping tasks which is minimize overhead, new concepts for HPC OSs are also referred to as noise. The performance of the needed as a response to increasing complexity while application is notably reduced by this noise. still considering existing API compatibility. Even small delays from cache misses or interrupts can affect the overall performance of a large scale In this paper we study the design concepts of het- application. So called jitter even influences collec- erogeneous kernels using the example of mOS and tive communication regarding the synchronization, the approach of library operating systems by ex- which can either absorb or propagate the noise. ploring the architecture of Exokernel. We sum- Even though asynchronous communication has a marize architectural decisions, present a similar much higher probability to absorb the noise, it is project in each case, Interface for Heterogeneous not completely unaffected. Collective operations Kernels and Unikernels respectively, and show suffer the most from propagation of jitter especially benchmark results where possible. when implemented linearly. But it is hard to anal- yse noise and its propagation for collective operations even for simple algorithms. Hoefler et al. Our investigations show that both concepts have [5] also suggest that \at large-scale, faster networks a high potential, reduce system noise and outper- are not able to improve the application speed sig- form a traditional Linux in tasks they are already nificantly because noise propagation is becoming a able to do. However the results are only proven by bottleneck." [5] micro-benchmarks as most projects lack the matu- rity for comprehensive evaluations at the point of Hoefler et al. [5] also show that synchronization this writing. of point-to-point and collective communication and 1 OS noise are tightly entangled and can not be dis- geneous architectures with different kinds of mem- cussed in isolation. At full scale, when it becomes ory and multiple memory controllers, like the recent the limiting factor, it eliminates all advantages of a Intel Xeon Phi architecture, or chips with different faster network in collective operations as well as full types of cores and coprocessors, specialized kernels applications. This finding is crucial for the design might help to use the full potential available. of large-scale systems because the noise bottleneck We will first have an in-depth look at mOS as at must be considered in system design. this example we will be able to see nicely what as- Yet very specialized systems like BlueGene/L [8] pects have to be taken care of in order to run dif- help to avoid most sources of noise [5]. Ferreira ferent kernels on the same node. et al. [3] show that the impact is dependent on parameters of the system, the already widely used concept of dedicated system nodes alone is not suf- 2.1 mOS ficient and that the placement of noisy nodes does matter. To get a light and specialized kernel there are two methods typically used: The first one is to take a generic Full-Weight-Kernel (FWK) and stripping This work gives an overview of recent developments away as much as possible; the second one is to build and new concepts in the field of operating systems a minimal kernel from scratch. Either of these two for high performance computing. The approaches approaches alone does not yield a fully Linux com- described in the following sections are, together patible kernel, which in turn won't be able to run with the traditional Full-Weight-Kernel approach, generic Linux applications [4]. the most common ones. The rest of this paper is structured as follows. We Thus the key design parameters of mOS are: will in the next section introduce the concept of run- full linux compatibility, limited changes to Linux, ning more than one kernel on a compute or service and full Light-Weight-Kernel scalability and perfor- node while exploring the details of that approach at mance, where performance and scalability are pri- the example of mOS and the Interface for Hetero- oritized. geneous Kernels. Section 3 investigates the idea of To avoid the tedious maintenance of patches to the library operating systems by having a look at Ex- Linux kernel, an approach inspired by FUSE has okernel, one of the first systems designed after that been taken. Its goal is to provide internal APIs concept, as well as a newer approach called Uniker- to coordinate resource management between Linux nels. Section 4 investigates Hermit Core which is a and Light-Weight-Kernels (LWK) while still allow- combination of the aforementioned designs. After ing each kernel to handle its own resources indepen- a comparison in Section 5 follows the conclusion in dently. Section 6. \At any given time, a sharable resource is either private to Linux or the LWK, so that it can be managed directly by the current owner." [11] The 2 Heterogeneous Kernels resources managed by LWK must meet the following requirements:i) to benefit from caching and The idea about the heterogeneous kernel approach reduced TLB misses, memory must be in phys- is to run multiple different kernels side-by-side. ically contiguous regions, ii) except for the ones Each kernel has its spectrum of jobs to fulfill and its of the applications no interrupts are to be gener- own dedicated resources. This makes it possible to ated, iii) full control over scheduling must be pro- have different operating environments on the parti- vided, iv) memory regions are to be shared among tioned hardware. Especially with a look to hetero- LWK processes,v) efficient access to hardware must 2 be provided in userspace, which includes well-per- Linux [11]. forming MPI and PGAS runtimes, vi) flexibility in The capability to direct system calls to the cor- allocated memory must be provided across cores rect implementor (referred to as triage). The idea (e.g. let rank0 have more memory than the other behind this separation is that performance critical ranks) and, vii) system calls are to be sent to the system calls will be serviced by the LWK to avoid Linux core or operating system node. jitter, less critical calls, like signaling or /proc re- mOS consists of six components which will be in- quests handles the local Linux kernel and all opera- troduced in one paragraph each: tions on the file system are offloaded to the operating system node (OSN). But this hierarchy of sys- According to Wisniewski et al. [11], the Linux run- tem call destinations does of course add complexity ning on the node can be any standard HPC Linux, not only to the triaging but also to the synchroniza- configured for minimal memory usage and without tion of the process context over the nodes [11]. disk paging. This component acts like a service providing Linux functionality to the LWK like a An offloading mechanism to an OSN. To remove TCP/IP stack. It takes the bulk of the OS adminis- the jitter from the compute node, avoid cache pol- tration to keep the LWK streamlined, but the most lution and make better use of memory, using a dedi- important aspects include: boot and configuration cated OSN to take care of I/O operations is already of the hardware, distribution of the resources to the an older concept. Even though the design of mOS LWK and provision of a familiar administrative in- would suggest to have file system operations han- terface for the node (e.g. job monitoring). dled on the local linux, the offloading mechanism improves resource usage and client scaling [11]. The LWK which is running (possibly in multiple instantiations) alongside the compute node Linux. The capability to partition resources is needed for The job of the LWK is to provide as much hardware running multiple kernels on the same node. Mem- as possible to the applications running, as well as ory partitioning can be done either statically by managing its assigned resources. As a consequence manipulating the memory maps at boot time and the LWK does take care of memory management registering reserved regions; or dynamically mak- and scheduling [11]. ing use of hotplugging. These same possibilities are valid for the assignment of cores. Physical devices A transport mechanism in order to let the Linux will in general be assigned to the Linux kernel in and LWK communicate with each other. This order to keep the LWK simple [11]. mechanism is explicit, labeled as function ship- ping, and comes in three different variations: via We have seen the description of the mOS architec- shared memory, messages or inter-processor inter- ture which showed us many considerations for run- rupts. For shared memory to work without major ning multiple kernels side-by-side. As the design modifications to Linux, the designers of mOS de- of mOS keeps compatibility with Linux core data cided to separate the physical memory into Linux- structures, most applications should be supported. managed and LWK-managed partitions; and to al- This project is still in an early development stage, low each kernel read access to the other's space. therefore an exhaustive performance evaluation is Messages and interrupts are inspired by a model not feasible at the moment.

Seminar HPC Trends Winter Term 2017/2018 New Operating System Concepts for High Performance Computing

The Linux Device File-System

Cielo Computational Environment Usage Model

Stepping Towards a Noiseless Linux Environment

Considerations for the SDP Operating System

CN-Linux (Compute Node Linux)

Exascale Computing Study: Technology Challenges in Achieving Exascale Systems

Argo Nodeos: Toward Unified Resource Management for Exascale

Using Containers to Deploy HPC Applications on Supercomputers and Clouds Andrew J

Serverové Operačné Systémy UNIX – Učebný Text Pre Stredné a Vysoké Školy

Reducing the Boot Time of Embedded Linux Systems

Compute Node Linux

CUG Program-3Press.Pdf