A Container-Based Approach to OS Specialization for Exascale Computing
Total Page:16
File Type:pdf, Size:1020Kb
A Container-Based Approach to OS Specialization for Exascale Computing Judicael A. Zounmevo∗, Swann Perarnau∗, Kamil Iskra∗, Kazutomo Yoshii∗, Roberto Gioiosay, Brian C. Van Essenz, Maya B. Gokhalez, Edgar A. Leonz ∗Argonne National Laboratory fjzounmevo@, perarnau@mcs., iskra@mcs., [email protected] yPacific Northwest National Laboratory. [email protected] zLawrence Livermore National Laboratory. fvanessen1, maya, [email protected] Abstract—Future exascale systems will impose several con- view is essential for scalability, enabling compute nodes to flicting challenges on the operating system (OS) running on manage and optimize massive intranode thread and task par- the compute nodes of such machines. On the one hand, the allelism and adapt to new memory technologies. In addition, targeted extreme scale requires the kind of high resource usage efficiency that is best provided by lightweight OSes. At the Argo introduces the idea of “enclaves,” a set of resources same time, substantial changes in hardware are expected for dedicated to a particular service and capable of introspection exascale systems. Compute nodes are expected to host a mix of and autonomic response. Enclaves will be able to change the general-purpose and special-purpose processors or accelerators system configuration of nodes and the allocation of power tailored for serial, parallel, compute-intensive, or I/O-intensive to different nodes or to migrate data or computations from workloads. Similarly, the deeper and more complex memory hierarchy will expose multiple coherence domains and NUMA one node to another. They will be used to demonstrate nodes in addition to incorporating nonvolatile RAM. That the support of different levels of fault tolerance—a key expected workload and hardware heterogeneity and complexity concern of exascale systems—with some enclaves handling is not compatible with the simplicity that characterizes high node failures by means of global restart and other enclaves performance lightweight kernels. In this work, we describe the supporting finer-level recovery. Argo Exascale node OS, which is our approach to providing in a single kernel the required OS environments for the two aforementioned conflicting goals. We resort to multiple OS specializations on top of a single Linux kernel coupled with We describe here the early stages of an ongoing effort, as multiple containers. part of Argo, to evolve the Linux kernel into an OS suitable Keywords-OS Specialization; Lean; Container; Cgroups for exascale nodes. A major step in designing any HPC OS involves reducing the interference between the OS and I. INTRODUCTION the HPC job. Consequently, we are exposing the hardware Disruptive new computing technology has already be- resources directly to the HPC runtime, which is a user-level gun to change the scientific computing landscape. Hybrid software layer. However, the increase in resource complexity CPUs, manycore systems, and low-power system-on-a-chip expected for the next-generation supercomputers creates a designs are being used in today’s most powerful high- need for some hardware management that is best left to the performance computing (HPC) systems. As these technology OS. The resource complexity comes from the heterogeneous shifts continue and exascale machines close in, the Argo set of compute cores and accelerators, coupled with deeper research project aims to provide an operating system and and more complex memory hierarchies with multiple co- runtime (OS/R) designed to support extreme-scale scientific herence and NUMA domains to cope with both the CPU computations. It aims to efficiently leverage new chip and heterogeneity and the massive intranode parallelism. Our de- interconnect technologies while addressing the new modali- sign is based on the concept of OS Specialization—splitting ties, programming environments, and workflows expected at the node OS into several autonomous components each exascale. managing its own subset of CPU cores, a memory region, At the heart of the project are four key innovations: etc. We introduce the concepts of ServiceOS which is a fully- dynamic reconfiguring of node resources in response to fledged Linux environment meant for node management workload changes, allowance for massive concurrency, a hi- and legacy application execution, and Compute Containers erarchical framework for power and fault management, and which realize lean OS environments meant for running HPC a cross-layer communication protocol that allows resource applications with little interference from the OS. managers and optimizers to communicate and control the platform. These innovations will result in an open-source prototype system that runs on several architectures. It is The rest of the paper is organized as follows. Section II expected to form the basis of production exascale systems provides motivation for putting forth OS specialization in deployed in the 2018–2020 timeframe. the Argo NodeOS. Section III describes our approach to The design is based on a hierarchical approach. A global OS specialization using containers, as well as how Compute view enables Argo to control resources such as power or Containers can achieve various levels of leanness over the interconnect bandwidth across the entire system, respond same fully-fledged Linux kernel used by the ServiceOS. to system faults, or tune application performance. A local Section IV discusses related work and Section V concludes. HPC job HPC job -Heterogeneous sets of compute cores Deep and complex memory hierarchy HPC runtime HPC runtime -Massive intra-node parallelism Coherence Coherence OS OS domain domain Device drivers Device drivers Small NUMA Accelerator NUMA Small node memory 0 node MassivelySmall Small Hardware Hardware MassivelySmall Accelerator MassivelySmall Massively NUMA NUMA Massively Parallel Big serial node memory 1 node Massively Parallel Big Parallel serial ... (a) Traditional HPC (b) Lean HPC OS massively Parallel cores cores Parallel cores cores Accelerator OS with limited with substantial OS- parallel cores cores Coherence Coherence cores cores domain memory n domain cores OS-bypass bypass NUMA NVRAM NUMA node node NVRAM Figure 1. Traditional HPC OS vs. lean OS with most resources exposed Accelerators NUMA NUMA Other to the HPC runtime Other special-purpose node node cores II. ESTABLISHING THE NEED FOR OS SPECIALIZATION Figure 3. Resource heterogeneity and complexity in exascale nodes A. Lean OS Recent 3.x Linux kernels have more than 15 millions lines of code (LoC). With only about 60,000 LoC, the IBM dynamic virtual memory to physical memory mapping. Sim- Compute Node Kernel (CNK) [1] of Mira, a Blue Gene/Q ilarly, it does not provide all the thread preemption scenarios supercomputer at Argonne National Laboratory, contrasts of a vanilla Linux kernel. In particular, the absence of time with Linux by being barely more than a thin layer between quantum-like sharing can prevent the well-known approach the hardware and the HPC job. The CNK LoC number to nonblocking operations that is fulfilled by hidden agents includes the kernel, the management network stack, all the backed by a middleware-level thread that runs in addition to supported system calls, and limited file system facilities. the application threads [4]. CNK is tailored to achieving the minimum possible interfer- While the overall Argo NodeOS seeks the leanness of a ence between the HPC job and its hardware usage; in doing lightweight kernel, there is provision for supporting legacy so, it provides the maximum possible hardware efficiency applications that need a full Linux environment. An aspect to the HPC job. As shown for collective operations [2], of the OS specialization is to provide side-by-side in the low OS interference can make a noticeable difference in same node: HPC job performance at extreme scales; in that respect, the • an OS environment for HPC applications that require a lightweight nature of kernels such as CNK is a compelling fully-fledged Linux kernel, characteristic for an exascale OS. • an OS environment that is lean. As much as possible, Contemporary HPC stacks already selectively bypass the that environment would allow the HPC runtime to get OS for direct access to certain devices such as RDMA- direct access to the hardware resources with little or no enabled network cards (Fig. 1(a)). The Argo project goes OS interference. further by exposing most hardware resources to the HPC runtime which executes as part of the application (Fig. 1(b)). C. Provisioning for Heterogeneous Resources On top of fulfilling the minimal interference goal, offloading Substantial changes in hardware are expected for exascale the hardware resource management of the application from systems. The deeper and more complex memory hierarchy the OS to the HPC runtime is justified by the runtime being will expose multiple coherence domains and NUMA nodes more knowledgeable about the needs of the HPC application in addition to incorporating NVRAM. The envisioned mul- than the OS. The OS as seen by the HPC application running tiple coherence and NUMA domains are the consequence in the Argo environment is said to be lean. of the expected increase in CPU core density and processor heterogeneity at the node level. That resource complexity B. Provisioning for Legacy Applications (Fig. 3) requires a hardware management rigor that is best A look at the list of the 500 most powerful supercomputers left to a fully-fledged kernel such as Linux. Hardware drivers (Top500) [3], as of November 2014, shows that such systems