Efficient Object Storage Journaling in a Distributed Parallel File

Efficient Object Storage Journaling in a Distributed Parallel File System Sarp Oral, Feiyi Wang, David Dillow, Galen Shipman, Ross Miller National Center for Computational Sciences Oak Ridge National Laboratory {oralhs,fwang2,gshipman,dillowda,rgmiller}@ornl.gov Oleg Drokin Lustre Center of Excellence at ORNL Sun Microsystem Inc. [email protected] Abstract 1 Introduction Large-scale HPC systems target a balance of file I/O per- Journaling is a widely used technique to increase file sys- formance with computational capability. Traditionally, the tem robustness against metadata and/or data corruptions. standard was 2 bytes per second of I/O bandwidth for each While the overhead of journaling can be masked by the page 1,000 FLOPs of computational capacity [18]. Maintain- cache for small-scale, local file systems, we found that Lus- ing that balance for a 1 Petaflops (PFLOPs) supercomputer tre’s use of journaling for the object store significantly im- would require the deployment a storage subsystem capa- pacted the overall performance of our large-scale center- ble of delivering 2 TB/sec of I/O bandwidth at a minimum. wide parallel file system. By requiring that each write re- Building such a system with current or near-term storage quest wait for a journal transaction to commit, Lustre in- technology would require on the order of 100,000 magnetic troduced serialization to the client request stream and im- disks. This would be cost prohibitive not only due to the posed additional latency due to disk head movement (seeks) raw material costs of the disks themselves, but also to the for each request. magnitude of the design, installation, and ongoing manage- In this paper, we present the challenges we faced while ment and electrical costs for the entire system, including deploying a very large scale production storage system. the RAID controllers, network links, and switches. At this Our work provides a head-to-head comparison of two sig- scale, reliability metrics for each component would virtu- nificantly different approaches to increasing the overall effi- ally guarantee that such a system would continuously oper- ciency of the Lustre file system. First, we present a hardware ate in a degraded mode due to ongoing simultaneous recon- solution using external journaling devices to eliminate the struction operations [22]. latencies incurred by the extra disk head seeks due to jour- The National Center for Computational Sciences naling. Second, we introduce a software-based optimization (NCCS) at Oak Ridge National Laboratory (ORNL) hosts to remove the synchronous commit for each write request, the world’s fastest supercomputer, Jaguar [8] with over 300 side-stepping additional latency and amortizing the journal TB of total system memory. Rather than rely on a traditional seeks across a much larger number of requests. I/O performance metric such as 2 byte/sec of I/O throughput for each 1000 FLOP of computational capacity a sur- Both solutions have been implemented and experimen- vey of application requirements was conducted prior to the tally tested on our Spider storage system, a very large scale design of the parallel I/O environment for Jaguar. This re- Lustre deployment. Our tests show both methods consid- sulted in a requirement of delivered bandwidth of over 166 erably improve the write performance, in some cases up GB/sec based on the ability to checkpoint 20% of total sys- to 93%. Testing with a real-world scientific application tem memory, once per hour, using no more than 10% of showed a 37% decrease in the number journal updates, total compute time. Based on application I/O profiles and each with an associated seek – which translated into an av- available resources, the Jaguar upgrade targeted 240 GB/s erage I/O bandwidth improvement of 56.3%. of storage bandwidth. Achieving this target on Jaguar has required a careful attention to detail and optimization of the 1 system at multiple levels, including the storage hardware, Major contributions of our work include measurements network topology, OS, I/O middleware, and application I/O and performance characterization of a very large storage architecture. system unique in its scale; The identification and elimina- There are many studies on user-level file system perfor- tion of serial bottlenecks in a large-scale parallel system; A mance of different Cray XT platforms and their respective cost-effective and novel solution to file system journaling storage subsystems. These provide important information overheads in a large scale system. for scientific application developers and system engineers The remainder of this paper is organized as follows: Sec- such as peak system throughput and the impact of Lustre tion 2 introduces Jaguar and its large-scale parallel I/O sub- file striping patterns [33, 1, 32]. However – to the best of system, while Section 3 provides a quick overview of the our knowledge – there has been no work done to analyze Lustre parallel file system and presents our initial findings the efficiency of the object storage system’s journaling and on the performance problems Lustre file system journaling. its impact of overall I/O throughput in a large-scale parallel Section 4 introduces our hardware solution to the problem file system such as Lustre. and Section 5 presents the software solution. Section 6 sum- Journaling is widely used by modern file systems to in- marizes and provides a discussion on results of our hard- crease file system robustness against metadata corruptions ware and software solutions and presents results of real sci- and to minimize file system recovery times after a system ence application using our software-based solution. Sec- crash. Aside from journaling, there are several other tech- tion 7 presents our conclusions. niques for preventing metadata corruption. Soft updates handle the metadata update problem by guaranteeing that 2 System Architecture blocks are written to disk in their required order without using synchronous disk I/O [10, 23]. Vendors such as Net- Jaguar is the main simulation platform deployed at work Appliance [3], have addressed the issue with a hard- ORNL. Jaguar entered service in 2005 and has undergone ware assisted approach (non-volatile RAM) resulting in per- several upgrades and additions since that time. Detailed de- formance superior to both journaling and soft updates at scriptions and performance evaluations of earlier Jaguar it- the expense of extra hardware. NFS version 3 [20] intro- erations can be found in the literature [1]. duced asynchronous writes to overcome the bottleneck of synchronous writes. The server is permitted to reply to the 2.1 Overview of Jaguar client before the data is on stable storage, which is simi- lar to our Lustre asynchronous solution. The Log-based file In late 2008, Jaguar was expanded with the addition of a system [17] took a departure from the conventional update- 1.4 PFLOPs Cray XT5 in addition to the existing Cray XT4 in-place approach by writing modified data and metadata in segment1. Resulting in a system with over 181,000 pro- a log. More recently, ZFS [13] has been coupled with flash- cessing cores connected internally via Cray’s SeaStar2+ [4] based devices for intent logging so that synchronous writes network. The XT4 and XT5 segments of Jaguar are con- are directed to these log devices with very low latency, im- nected via a DDR InfiniBand network that also provides proving overall performance. a link to our center-wide file system, Spider. More infor- While the overhead of journaling can be masked by us- mation about the Cray XT5 architecture and Jaguar can be ing the page cache for local file systems, our experiments found in [5, 19]. show that on a large-scale parallel Lustre file system it can Jaguar has 200 Cray XT5 cabinets. Each cabinet has substantially degrade overall performance. 24 compute blades. Each blade has 4 compute nodes and In this paper, we present our experiences and the chal- each compute node has two AMD Opteron 2356 Barcelona lenges we faced towards deploying a very large scale pro- quad-core processors. Figure 1 shows the high-level Cray duction storage system. Our findings suggest that sub- XT5 node architecture. The configuration tested, has 16 optimal object storage file system journaling performance GB of DDR2-800 MHz memory per compute node (2 significantly hurts the overall parallel file system perfor- GB per core), for a total of 300 TB of system memory. mance. Our work provides a head-to-head comparison of Each processor is linked with dual HyperTransport connec- two significantly different approaches to increasing overall tions. The HyperTransport interface enables direct high- efficiency of the Lustre file system. First, we present a hard- bandwidth connections between the processor, memory and ware solution using external journaling devices to eliminate the SeaStar2+ chip. The result is a dual-socket, eight-core the latencies incurred by the extra disk head seeks for the node with a peak processing performance of 73.6 GFLOPS. journal traffic. Second, we introduce a software-based opti- 1 mization to remove the synchronous commit for each write A more recent Jaguar XT5 upgrade swapped the quad-core AMD Opteron 2356 CPUs (Barcelona) with hex-core AMD Opteron 2435 CPUs request, side-stepping additional latency and amortizing the (Istanbul), increasing the installed peak performance of Jaguar XT5 to 2.33 journal seeks across a much larger number of requests. PFLOP and total number of cores to 224,256. 2 The XT5 segment has 214 service and I/O nodes, of which Disk Controller 1 Disk Controller 2 192 provide connectivity to the Spider center-wide file system with 240 GB/s of demonstrated file system bandwidth ... ... Channel A 1A 2A 14A 15A 16A 28A Channel A over the scalable I/O network (SION).

Efficient Object Storage Journaling in a Distributed Parallel File

Copy on Write Based File Systems Performance Analysis and Implementation

CS 5600 Computer Systems

Ext4 File System and Crash Consistency

Improving Journaling File System Performance in Virtualization

Lessons Learned in Deploying the World's Largest Scale Lustre File

Titan: a New Leadership Computer for Science

Musings RIK FARROWOPINION

Jaguar Supercomputer

Model-Based Failure Analysis of Journaling File Systems

Journaling File System Forensics

Ted Ts'o on Linux File Systems

COS 318: Operating Systems Journaling, NFS and WAFL