Operating System for the K Computer

Operating System for the K computer Jun Moroo Masahiko Yamada Takeharu Kato For the K computer to achieve the world’s highest performance, Fujitsu has worked on the following three performance improvements in the development of the operating system (OS). First, to bring out the maximum hardware performance of our original CPU and interconnect, we provided a mechanism of controlling hardware extensions directly from applications. As the second improvement, we have introduced the synchronization scheduling function that minimizes the synchronization wait time of parallel programs resulting from system interruptions by coordinating job runtime and system runtime between multiple nodes. Third, multiple page size support that allows use of more than one page size has been achieved for improved memory access performance and memory utilization efficiency. This paper also describes the performance improvement functions, usability and robustness of the OS developed. 1. Introduction performance of the supercomputer is essential. Supercomputers are used in large-scale This paper gives a description of the OS of simulation computations in a variety of science the K computernote)i that is capable of conducting and technology fields. For weather forecasts, for large-scale simulations. example, the prediction area can be divided at the municipality level or even further at the station 2. Software configuration of the level to calculate values such as the temperature, K computer wind direction, and wind speed for each sub- Generally, supercomputers perform high- area, which enables highly accurate predictions speed computations by: to be made. Aerodynamic simulations on aircraft 1) Dividing huge computations allow people to estimate how an aircraft will 2) Operating many computers (compute nodes) move without having to build a model or actual concurrently in a coordinated manner aircraft. In the development of new drugs, 3) Transferring computation results at high substances that are candidates for a medicine speeds between compute nodes by using can be extracted out of a vast number of types a set of standard communication libraries and combinations to narrow them down to those called Message Passing Interface (MPI) with a curative effect. The OS processes application startup and IO Carrying out such large-scale simulations note)i “K computer” is the English name involves enormous amounts of memory and that RIKEN has been using for the compute nodes and a huge number of files. supercomputer of this project since July Accordingly, an operating system (OS) capable 2010. “K” comes from the Japanese word “Kei,” which means ten peta or 10 to the of bringing out the maximum hardware 16th power. FUJITSU Sci. Tech. J., Vol. 48, No. 3, pp. 295–301 (July 2012) 295 J. Moroo et al.: Operating System for the K computer processing requests and takes charge of system kernel and has additional drivers for using the processing including time of day control. For hardware of the K computer so that it can be the OS of the K computer, we have maximized used from the middleware in the upper level. the hardware and software performance by We have worked on developing the OS to improving the OS kernel and libraries. We install on the K computer, aiming for targets worked on developing the OS of the K computer including improved usability, improved with a target of achieving 10 PFLOPS, a performance and improved robustness as shown performance level ten times higher than the in Table 1. The following sections present the system at the time of its initial development. respective targets. The basic software of the K computer is composed of the OS and basic middleware 3. Improved usability (effective (operations management software, language utilization of user assets) system) (Figure 1). The OS implements the Some of the existing supercomputers use architecture-dependent portion of the Linux special OSes, on which software that has been Operations management software Language system Middleware Job management Compilers Debuggers tuning tools System control/management Language libraries MPI libraries OS Commands, libraries Process management Network File system OS Interrupt time management Memory management Device drivers Architecture-dependent portion Sector cache Barriers Tofu IO Tofu CPU (SPARC64 VIIIfx) Sector cache Barriers IO devices interconnect Hardware Figure 1 Software configuration of the K computer. Table 1 Features of the K computer OS. Feature Realization method Linux Improved usability Effective utilization of user assets POSIX API SIMD instruction support Hardware barrier Direct control of hardware extensions Sector cache Improved Extension of device drivers for performance hardware control registers Improved scalability Synchronization scheduling with rsadc Improved memory performance and efficiency Large page support Security User access protection Improved robustness RAS enhancement Fallback for failure 296 FUJITSU Sci. Tech. J., Vol. 48, No. 3 (July 2012) J. Moroo et al.: Operating System for the K computer developed by the user cannot be used. For the data in sector 1. This prevents frequently used OS of the K computer, we have adopted Linux to data from being expelled from the cache, and allow continued use of the user’s assets. thus improves the execution performance of Similarly to general OSes, the OS of the K applications. The amount of frequently used computer is equipped with process management, data may vary depending on the functions in memory management, IO management (device applications. Accordingly, the ratio of the sector driver), file system and network functions. cache used from user applications has been Linux is used in more than 91% of the TOP500 made changeable during program execution to ranked supercomputers and we have adopted allow the utilization efficiency of the cache to be it in view of making user applications portable. optimized. Users can use POSIX API, a UNIX application The performance improvement effect of programming interface (API), on the K computer. cache control cannot be sufficiently achieved In addition, the OS of the K computer has if overhead is generated due to the issuance API (library, command) extensions for using of a system call when a device driver is called extended hardware, which allows easy use of along with sector cache operation. For that hardware functions through the compiler and reason, device drivers for the K computer have MPI library functions. been designed and implemented to allow user These contrivances enable the OS of applications to directly control the registers that the K computer to port general open source control the sector cache. This has successfully software (application programs, tools, etc.) by reduced the sector cache control time from a few recompilation without having to change the microseconds to a few nanoseconds. source program. 4.2 Improved memory access 4. Improved performance performance 4.1 Direct control of hardware extensions The OS of the K computer is equipped SPARC64 VIIIfx is a SPARC64 VII-based with large page support as a function to provide CPU equipped with expanded registers, extended both high memory access performance and high SIMD instructions, hardware barriers between memory utilization efficiency. cores and sector cache for speedup. For the OS of Generally, the virtual memory management the K computer, we have prepared a mechanism unit provided for Linux for 64-bit SPARC with additional device drivers that allow these manages the memory used by programs in hardware extensions to be controlled directly 8 Kbyte page units. from applications. The drivers map registers A cache (TLB: translation lookaside that control barrier synchronization and sector buffer) is provided for speeding up the address cache into the memory space of applications, by translation performed by the hardware but the which high-speed control is achieved without number of entries is limited. With a page size involving system call overhead. of 8 Kbytes, use of memory of a few Mbytes or The sector cache is intended to be used to larger causes the process to exceed the capacity divide the cache in the CPU into two virtual of the TLB and overhead resulting from OS regions to facilitate caching of the data that are address translation is generated. repeatedly used by applications. If the page size to be managed with one Specifically, the complier generates memory entry is increased, high-speed access to memory access instructions to cache less frequently of a few Gbytes or larger can be achieved by used data in sector 0 and more frequently used means of a TLB. However, managing the entire FUJITSU Sci. Tech. J., Vol. 48, No. 3 (July 2012) 297 J. Moroo et al.: Operating System for the K computer data area with one-sized large pages may reduce the job. the memory utilization efficiency because of the This daemon operation causes interruptions unused memory area in a large page. To address (system noise) of a parallel program. In this problem, the OS of the K computer allows computation with tens of thousands of nodes, the page size to be selected from the options of interruptions may increase in proportion to 4 Mbytes, 32 Mbytes and 256 Mbytes. the number of nodes, leading to a significant For example, to run an application that performance degradation. There are two ways to uses a stack larger than the area for static data solve this problem: including global variables, a large page can be 1) Dedicate a specific core of a node composed allocated to the stack and a small page

Operating System for the K Computer

Chapter 1: Introduction What Is an Operating System?

The Linux Kernel Module Programming Guide

Balanced Multithreading: Increasing Throughput Via a Low Cost Multithreading Hierarchy

Amdahl's Law Threading, Openmp

Parallel Programming

Parallel Generation of Image Layers Constructed by Edge Detection

CS650 Computer Architecture Lecture 10 Introduction to Multiprocessors

Absolute BSD—The Ultimate Guide to Freebsd Table of Contents Absolute BSD—The Ultimate Guide to Freebsd

Real-Time Performance During CUDA™ a Demonstration and Analysis of Redhawk™ CUDA RT Optimizations

Unit: 4 Processes and Threads in Distributed Systems

Gpu Concurrency

Personal-Computer Systems • Parallel Systems • Distributed Systems • Real -Time Systems