Quick viewing(Text Mode)

Operating System for the K Computer

Operating System for the K Computer

Operating for the tag">K

 Jun Moroo  Masahiko Yamada  Takeharu Kato

For the to achieve the world’s highest performance, has worked on the following three performance improvements in the development of the (OS). First, to bring out the maximum hardware performance of our original CPU and interconnect, we provided a mechanism of controlling hardware extensions directly from applications. As the second improvement, we have introduced the synchronization function that minimizes the synchronization of parallel programs resulting from system interruptions by coordinating runtime and system runtime between multiple nodes. Third, multiple size support that allows use of than one page size has been achieved for improved memory access performance and memory utilization efficiency. This paper also describes the performance improvement functions, usability and robustness of the OS developed.

1. Introduction performance of the is essential. are used in large-scale This paper gives a description of the OS of simulation computations in a variety of science the K computernote)i that is capable of conducting and fields. For weather forecasts, for large-scale simulations. example, the prediction area can be divided the municipality level or even further at the station 2. configuration of the level to calculate values such as the temperature, K computer wind direction, and wind speed for each sub- Generally, supercomputers perform high- area, enables highly accurate predictions speed computations by: to be made. Aerodynamic simulations on aircraft 1) Dividing huge computations allow to estimate how an aircraft will 2) Operating many (compute nodes) without having to build a model or actual concurrently in a coordinated manner aircraft. In the development of new drugs, 3) Transferring computation results at high substances that are candidates for a medicine speeds between compute nodes by using can be extracted out of a vast number of types a set of libraries and combinations to narrow them down to those called Interface (MPI) with a curative effect. The OS processes application startup and IO Carrying out such large-scale simulations note)i “K computer” is the English name involves enormous amounts of memory and that has been using for the compute nodes and a huge number of files. supercomputer of this project since July Accordingly, an operating system (OS) capable 2010. “K” comes from the Japanese word “Kei,” which means ten peta or 10 to the of bringing out the maximum hardware 16th power.

FUJITSU Sci. Tech. ., . 48, No. 3, pp. 295–301 (July 2012) 295 J. Moroo et al.: Operating System for the K computer

processing requests and takes charge of system kernel and has additional drivers for using the processing including time of day control. For hardware of the K computer so that it can be the OS of the K computer, we have maximized used from the in the upper level. the hardware and software performance by We have worked on developing the OS to improving the OS kernel and libraries. We install on the K computer, aiming for targets worked on developing the OS of the K computer including improved usability, improved with a of achieving 10 PFLOPS, a performance and improved robustness as shown performance level ten times higher than the in Table 1. The following sections present the system at the time of its initial development. respective targets. The software of the K computer is composed of the OS and basic middleware 3. Improved usability (effective (operations management software, language utilization of assets) system) (Figure 1). The OS implements the Some of the existing supercomputers use architecture-dependent portion of the special OSes, on which software that has been

Operations management software Language system

Middleware Job management Debuggers tuning tools

System control/management Language libraries MPI libraries

OS Commands, libraries

Process management Network system OS time management Device drivers

Architecture-dependent portion Sector Barriers Tofu IO

Tofu CPU (SPARC64 VIIIfx) Sector cache Barriers IO devices interconnect Hardware

Figure 1 Software configuration of the K computer.

Table 1 Features of the K computer OS. Feature Realization method Linux Improved usability Effective utilization of user assets POSIX API SIMD instruction support Hardware barrier Direct control of hardware extensions Sector cache Improved Extension of device drivers for performance hardware control registers

Improved Synchronization scheduling with rsadc Improved memory performance and efficiency Large page support Security User access protection Improved robustness RAS enhancement Fallback for failure

296 FUJITSU Sci. Tech. J., Vol. 48, No. 3 (July 2012) J. Moroo et al.: Operating System for the K computer

developed by the user cannot be used. For the in sector 1. This prevents frequently used OS of the K computer, we have adopted Linux to data from being expelled from the cache, and allow continued use of the user’s assets. thus improves the performance of Similarly to general OSes, the OS of the K applications. The amount of frequently used computer is equipped with management, data may vary depending on the functions in memory management, IO management (device applications. Accordingly, the ratio of the sector driver), and network functions. cache used from user applications has been Linux is used in more than 91% of the TOP500 made changeable during program execution to ranked supercomputers and we have adopted allow the utilization efficiency of the cache to be it in view of making user applications portable. optimized. Users can use POSIX API, a application The performance improvement effect of programming interface (API), on the K computer. cache control cannot be sufficiently achieved In addition, the OS of the K computer has if overhead is generated due to the issuance API (, command) extensions for using of a when a is called extended hardware, which allows easy use of along with sector cache operation. For that hardware functions through the and reason, device drivers for the K computer have MPI library functions. been designed and implemented to allow user These contrivances enable the OS of applications to directly control the registers that the K computer to port general open source control the sector cache. This has successfully software (application programs, tools, etc.) by reduced the sector cache control time from a few recompilation without having to change the microseconds to a few nanoseconds. source program. 4.2 Improved memory access 4. Improved performance performance 4.1 Direct control of hardware extensions The OS of the K computer is equipped SPARC64 VIIIfx is a SPARC64 VII-based with large page support as a function to provide CPU equipped with expanded registers, extended both access performance and high SIMD instructions, hardware barriers between memory utilization efficiency. cores and sector cache for . For the OS of Generally, the management the K computer, we have prepared a mechanism unit provided for Linux for 64-bit SPARC with additional device drivers that allow these manages the memory used by programs in hardware extensions to be controlled directly 8 Kbyte page units. from applications. The drivers map registers A cache (TLB: translation lookaside that control barrier synchronization and sector buffer) is provided for speeding up the address cache into the memory space of applications, by translation performed by the hardware but the which high-speed control is achieved without number of entries is limited. With a page size involving system call overhead. of 8 Kbytes, use of memory of a few Mbytes or The sector cache is intended to be used to larger causes the process to exceed the capacity divide the cache in the CPU into two virtual of the TLB and overhead resulting from OS regions to facilitate caching of the data that are address translation is generated. repeatedly used by applications. If the page size to be managed with one Specifically, the complier generates memory entry is increased, high-speed access to memory access instructions to cache frequently of a few Gbytes or larger can be achieved by used data in sector 0 and more frequently used means of a TLB. However, managing the entire

FUJITSU Sci. Tech. J., Vol. 48, No. 3 (July 2012) 297 J. Moroo et al.: Operating System for the K computer

data area with one-sized large may reduce the job. the memory utilization efficiency because of the This operation causes interruptions unused memory area in a large page. To address (system noise) of a parallel program. In this problem, the OS of the K computer allows computation with tens of thousands of nodes, the page size to be selected from the options of interruptions may increase in proportion to 4 Mbytes, 32 Mbytes and 256 Mbytes. the number of nodes, leading to a significant For example, to run an application that performance degradation. There are two ways to uses a stack larger than the area for static data solve this problem: including global variables, a large page can be 1) Dedicate a specific core of a node composed allocated to the stack and a page to the of multiple CPU cores to the OS to run the data area. In this way, both improved memory daemon for eliminating interruptions in the utilization efficiency and improved performance job on each node (spatial division) thanks to the reduction of TLB misses can be 2) Reduce interruptions in synchronization/ achieved (Figure 2). communication of a parallel program to a certain level independently of the number of 4.3 Improved scalability nodes by synchronized running of daemons We have developed a synchronization between compute nodes (synchronization scheduler and sampling mechanism scheduler method) (temporal division) for the purpose of improving the performance of With the method described in 1), when one parallel jobs across multiple nodes (scalability of the eight cores of the CPU of the K computer improvement). is allocated to the system and seven cores to the job, the execution efficiency is limited to 87.5% 4.3.1 Synchronization scheduler (seven-eighths). Accordingly, we have developed A single computation job that runs on the K the synchronization scheduler method described computer system is a parallel program deployed in 2) for the K computer. on multiple nodes and, every time a certain With the synchronization scheduler of the computation process is completed, mutual K computer, the OS uses the barrier function synchronization and communication take place of the Tofu interconnect to achieve inter-node between nodes connected via interconnects. synchronization. The synchronization scheduler For system operation, the OS and operations enhances the process scheduler function of software daemons (programs that run in the Linux to perform synchronization at intervals background for system management) need to of 100 ms, for example, and run the daemons in run, and they run asynchronously regardless of 1 ms out of the 100 ms. The parallel program can

• Example of process that uses 32 Mbytes of static data

Page with 256 Mbytes of static data Page with 32 Mbytes of static data Static data area 32 Mbytes used 32 Mbytes used Static data area (32 Mbytes) (256 Mbytes) 224 Mbytes of unused No unused area generated area generated Memory utilization efficiency increased

Stack area Stack area (256 Mbytes) (256 Mbytes)

Figure 2 Effect of multiple page size support in process that uses 32 Mbytes of static data.

298 FUJITSU Sci. Tech. J., Vol. 48, No. 3 (July 2012) J. Moroo et al.: Operating System for the K computer

run 99% of the time in synchronization between run periodically in the system to be minimized. all nodes without interruptions, which allows for To monitor to ensure normal operation of a performance improvement in proportion to the the system and analyze performance bottlenecks, number of nodes (Figure 3). the OS samples statistics including CPU and memory usage. Generally, for OS statistics, 4.3.2 Statistics sampling function the sadc command is run via , a function The K computer reduces interruptions intended for periodically executing system themselves by using the synchronization processes, to kernel data and to the scheduler to lower the effect of daemons to statistics file. This method does not cause major jobs and optimizing daemon processes. As an system noise with small but with the example, this section describes the improvement K computer, which is an enormous system, this of the statistics sampling function. statistics sampling process may result in system When a user application runs on a compute noise. node, daemon processes that run periodically For the K computer, remote sadc function in the system can cause interruptions (system (rsadc), which collects compute node performance noise) to the user application and result in statistics in IO nodes, has been developed variation in execution performance of the user (Figure 4). We have aimed to reduce costs application (the execution performance may vary and improve reliability with the K computer depending on the execution timing of the user by aggregating IO devices onto IO nodes in application). each rack. We have reduced the number of For this reason, reducing system noise on a interruptions by offloading to these IO nodes compute node requires the daemon processes that the IO processes of performance statistics of the

Actual computation job Ideal computation job

Node 1 Node 2 ... Node N Node 1 Node 2 ... Node N

Time Time

: System process : Synchronization : Job execution section : Synchronization : Synchronization end execution section wait section

Figure 3 Operation of synchronization scheduler.

FUJITSU Sci. Tech. J., Vol. 48, No. 3 (July 2012) 299 J. Moroo et al.: Operating System for the K computer

sadc used rsadc used Compute node IO node Compute node IO node

sadc daemon nfs ksadc daemon rsadc

nfs client TCP/IP TCP/IP IO IO Tofu Tofu Tofu Tofu

RDMA

Figure 4 Schematic diagram of rsadc. many compute nodes. system must be secure so that it is possible to The ksadc daemon of compute nodes files within user groups or prevent other maintains the performance statistics of the groups from viewing the files. individual nodes in the memory and carries out The K computer has been given high IO processes by means of rsadc on IO nodes. The robustness by making use of hardware command rsadc uses the remote direct memory redundancy, RAS functions provided for access (RDMA) function of the Tofu interconnect Linux including multipath driver and security to transfer the performance statistics of the function and adding to Linux the RAS functions individual compute nodes to IO node memory that have been enhanced in Fujitsu’s existing without daemon processes on compute nodes, supercomputers and servers. which can be stored on disks. The following subsections describe examples For the K computer, we have developed a of the enhanced RAS functions and security function for a system composed of many racks functions. so that it can collect log files required for billing and system management from IO nodes of 5.1 RAS enhancement the many racks into one control node without The OS of the K computer has inherited accessing compute nodes. This has enabled the the existing server of Fujitsu to administrator to gain an understanding of the enhance the functions of intermittent memory conditions of the entire system without causing error recovery and identification and notification interruptions in parallel jobs. of hardware failure points. To address memory errors caused by alpha 5. Improved robustness rays and such like, the OS of the K computer The K computer, which is composed of a cooperates with the memory patrol function of large number of hardware components, is a the hardware to perform data correction, thereby system required by various users to perform making the system highly reliable. lengthy computations and it needs to be highly The hardware performs memory read robust. In relation to hardware failures, for (patrol) at intervals specified by the OS. By using example, it is important to have reliability, the error correcting code (ECC) mechanism, availability and serviceability (RAS) functions, the OS writes data to the memory when any which allow continued operation even if one single-bit error is detected in the memory data, component fails and early detection of the point by which error correction including ECC is of failure for replacement. In addition, the performed. Performing this error correction

300 FUJITSU Sci. Tech. J., Vol. 48, No. 3 (July 2012) J. Moroo et al.: Operating System for the K computer

periodically prevents correctable single-bit errors used in UNIX-based OSes. from developing into uncorrectable double-bit The OS supports the following security errors, which improves reliability. functions in relation to authentication by general Once any failed page is detected, the OS user management. marks it to prevent it being allocated memory • encryption when any subsequent memory allocation request • Change of fi le and access rights is issued, thereby retiring the page. • Change of owners of fi les and directories When any memory error is detected, the OS • Change of group ownership of fi les and specifi es the slot containing that memory and directories notifi es the administrator of it so as to reduce For fi les and directories, security the time from stopping the node to replacing management is provided based on the controlled the component. The RAS functions have been access protection profi le (CAPP), which enhanced to allow precise identifi cation of controls access by “read,” “write” and “execute” possible points of failure based on error events permissions for user, group and other. detected in relation to CPU and IO devices (PCI channel, Gbit Ethernet adapter, Fibre Channel 6. Conclusion adapter, etc.) in addition to memory. This paper has described the features of the OS of the K computer. The development of an 5.2 Security functions OS for a globally unparalleled large-scale system The K computer system is used was a challenging , and we needed to achieve simultaneously by many users and it is essential both ultimate performance and usability. We that it has security functions. However, security intend to continuously work on developing the functions do not sense if they affect the OS to provide the foundation for more easily computing performance. conducting large-scale simulations on the K The security functions of the K computer computer and advance various fi elds in have been achieved by combining the user science and industry. management and fi le system functions normally

Jun Moroo Takeharu Kato Fujitsu Ltd. Fujitsu Ltd. Mr. Moroo is currently engaged Mr. Kato is currently engaged in in for - software development for next- generation HPC. generation HPC.

Masahiko Yamada Fujitsu Ltd. Mr. Yamada is currently engaged in software development for next- generation HPC.

FUJITSU Sci. Tech. J., Vol. 48, No. 3 (July 2012) 301