High-Performance and Highly Reliable File System for the K Computer
Total Page:16
File Type:pdf, Size:1020Kb
High-Performance and Highly Reliable File System for the K computer Kenichiro Sakai Shinji Sumimoto Motoyoshi Kurokawa RIKEN and Fujitsu have been developing the world’s fastest supercomputer, the K computer. In addition to over 80 000 compute nodes, the K computer has several tens of petabytes storage capacity and over one terabyte per second of I/O bandwidth. It is the largest and fastest storage system in the world. In order to take advantage of this huge storage system and achieve high scalability and high stability, we developed the Fujitsu Exabyte File System (FEFS), a clustered distributed file system. This paper outlines the K computer’s file system and introduces measures taken in FEFS to address key issues in a large-scale system. 1. Introduction K computer,note)i which is currently the fastest The dramatic jumps in supercomputer supercomputer in the world. performance in recent years led to the This paper outlines the K computer’s file achievement of 10 peta floating point operations system and introduces measures taken in FEFS per second (PFLOPS) in 2011. The increasing to address key issues in a large-scale system. number of compute nodes and cores and the growing amount of on-board memory mean 2. File system for the K that file systems must have greater capacity computer and total throughput. File-system capacity To achieve a level of performance and stable and performance are expected to reach the job execution appropriate for the world’s fastest 100-petabyte (PB) class and 1-terabyte/s (TB/s) supercomputer, We adopted a two-layer model class, respectively, as both capacity and for the file system of the K computer. This model performance are increasing at a nearly tenfold consists of a local file system used as a high- rate every year. For this reason, mainstream file speed, temporary-storage area for jobs only and a systems have been changing from a single-server global file system used as a large-capacity shared- type to a clustered type. storage area for storing user files Figure( 1). We have developed a clustered distributed In this two-tiered model, jobs are executed file system called the Fujitsu Exabyte File through data transfers between these two file System (FEFS) with the aim of achieving a file systems using a file-staging control function. system having a capacity and performance that The following describes the respective roles of can handle the computing performance of the note)i “K computer” is the English name that RIKEN has been using for the supercomputer of this project since July 2010. “K” comes from the Japanese word “Kei,” which means ten peta or 10 to the 16th power. 302 FUJITSU Sci. Tech. J., Vol. 48, No. 3, pp. 302–309 (July 2012) K. Sakai et al.: High-Performance and Highly Reliable File System for the K computer Compute node group Compute Compute Compute Compute Compute Compute Local file system node node node node node node Tofu interconnect I/O node I/O node I/O node I/O node I/O node I/O node (OSS) (OSS) (OSS) jobs Batch (OSS) (OSS) (OSS) MDS MDS (active) (standby) Interactive jobs Data transfer network and management network Staging function Job Login node management node MDS MDS OSS OSS (active) (standby) (active) (active) Reference ReferencePeripheral devices Global file system MDS: Metadata server OSS: Object storage server Figure 1 FigureTwo-layer 1 file system model. Two-layer file system model. the local file system, global file system, and file- user files consisting of job input/output data staging process. and other content. In addition to providing file 1) Local file system access through the login node, the global file This is a high-speed, temporary storage system enables direct file access from a compute area dedicated to job execution for extracting node when executing an interactive job so that a maximum file-I/O performance for applications user can perform program debugging and tuning executed as batch jobs. The local file system while checking the job output in real time. transfers input/output files from and to the global A compute node connects to a group of file file system through the file-staging function and servers in the global file system through an temporarily holds job files that are either under InfiniBand link mounted on the I/O-dedicated or waiting for execution. node. In short, an I/O node relays data between The file server used to access data blocks the Tofu interconnect and InfiniBand while a connects to a compute node through the Tofu compute node accesses file servers via that I/O interconnect using an I/O-dedicated node (I/O node. node) within the compute rack through remote 3) File staging direct memory access (RDMA) communications. File transfer between the local and global As a result, low-latency, high-throughput file file systems is performed automatically by the data transfer is achieved. system through the file-staging function, which 2) Global file system operates in conjunction with job operation This is a large-capacity, shared storage software. Before a job begins, this function area external to the compute racks for storing transfers the input files for that job from the FUJITSU Sci. Tech. J., Vol. 48, No. 3 (July 2012) 303 K. Sakai et al.: High-Performance and Highly Reliable File System for the K computer global file system to the local file system (stage- file system size and maximum file size to the in), and after the job completes, it transfers the 8-exabyte (EB) class relative to existing Lustre output files from the local file system to the specifications. File system size can be extended global file system (stage-out). The user specifies even during system operation by adding more the stage-in and stage-out files in a job script. servers and storage devices. The following sections describe the FEFS 3. FEFS: Ultra-large-scale cluster measures listed below for dealing with an ultra- file system large-scale supercomputer system. We developed FEFS to provide a local • Elimination of I/O contention by using file system and global file system suitable for communication-bus and disk splitting the top-class performance of the K computer. • Continuous availability by using hardware The following objectives were established in duplication and failover developing FEFS: • Hierarchical node monitoring and automatic • World’s fastest I/O performance and high- switching performance Message Passing Interface • QoS-policy selection by type of operation (MPI)-IO • Stable job run times achieved by eliminating 4. Elimination of I/O contention interruptions by using communication-bus • World’s largest file-system capacity and disk splitting • Greater scalability in performance and As a clustered file system, FEFS bundles capacity by adding hardware together a large number of file servers and • High reliability (service continuity and data improves parallel throughput performance preservation) in proportion to the number of file servers. • Ease of use (sharing by many users) However, if access should become concentrated • Fair share (fair use by many users) on specific file servers, communications can To achieve these objectives, We developed become congested and disk-access contention can FEFS on the basis of the Lustre open-source file arise, resulting in a drop in I/O performance and system, which has become a worldwide industry a variation in job run times between compute standard, while extending certain specifications nodes. It is therefore essential that file I/O and functions as needed. As shown in Table 1, contention be completely eliminated to ensure FEFS extends specifications like maximum stable job execution. Table 1 Lustre and FEFS comparison. Function Lustre FEFS File system size 64 PB 8 EB File size 320 TB 8 EB System specifications No. of files 4 G 8 E (max. values) OST volume size 16 TB 1 PB No. of stripes 160 20 000 No. of ACL entries 32 8191 No. of OSSs 1020 20 000 Scalability No. of OSTs 8150 20 000 (max. values) No. of clients 128 units 1 000 000 units Block size (Lustre disk file system) 4 KB up to 512 KB (backend file system) 304 FUJITSU Sci. Tech. J., Vol. 48, No. 3 (July 2012) K. Sakai et al.: High-Performance and Highly Reliable File System for the K computer In FEFS, file I/O contention is eliminated service and system operations from continuing. by separating file I/O access into two stages, However, when configuring FEFS with more namely, in units of jobs and units of compute than several hundred file servers and storage nodes (Figure 2). devices, there are bound to be times when At the job level, each job is assigned a hardware such as network adaptors and servers separate range of I/O nodes corresponding to are in a fault or maintenance state. To ensure file-storage destinations to eliminate file I/O that operations across the entire system continue contention on servers, on the network, and on even under such conditions, it is essential that disks. And on the compute-node level within a faults be automatically detected and detours job, the transmitting and receiving of file data is taken around faulty locations so that file-system performed via I/O nodes constituting a minimum service can continue unimpeded. number of Tofu-communication hops, thereby For this reason, FEFS duplicates hardware minimizing file I/O contention among compute and uses a software-based control process nodes. to initiate failover processing at the time of a hardware fault. This processing switches 5. Continuous availability by the server and I/O communication paths and using hardware duplication restores the service in progress to its normal and failover state.