Lessons Learned in Deploying the World's Largest Scale Lustre File
Total Page:16
File Type:pdf, Size:1020Kb
Lessons Learned in Deploying the World’s Largest Scale Lustre File System Galen M. Shipman, David A. Dillow, Sarp Oral, Feiyi Wang, Douglas Fuller, Jason Hill, Zhe Zhang Oak Ridge Leadership Computing Facility, Oak Ridge National Laboratory Oak Ridge, TN 37831, USA fgshipman,dillowda,oralhs,fwang2,fullerdj,hilljj,[email protected] Abstract 1 Introduction The Spider system at the Oak Ridge National Labo- The Oak Ridge Leadership Computing Facility ratory’s Leadership Computing Facility (OLCF) is the (OLCF) at Oak Ridge National Laboratory (ORNL) world’s largest scale Lustre parallel file system. Envi- hosts the world’s most powerful supercomputer, sioned as a shared parallel file system capable of de- Jaguar [2, 14, 7], a 2.332 Petaflop/s Cray XT5 [5]. livering both the bandwidth and capacity requirements OLCF also hosts an array of other computational re- of the OLCF’s diverse computational environment, the sources such as a 263 Teraflop/s Cray XT4 [1], visual- project had a number of ambitious goals. To support the ization, and application development platforms. Each of workloads of the OLCF’s diverse computational plat- these systems requires a reliable, high-performance and forms, the aggregate performance and storage capacity scalable file system for data storage. of Spider exceed that of our previously deployed systems Parallel file systems on leadership-class systems have by a factor of 6x - 240 GB/sec, and 17x - 10 Petabytes, traditionally been tightly coupled to single simulation respectively. Furthermore, Spider supports over 26,000 platforms. This approach had resulted in the deploy- clients concurrently accessing the file system, which ex- ment of a dedicated file system for each computational ceeds our previously deployed systems by nearly 4x. In platform at the OLCF, creating islands of data. These addition to these scalability challenges, moving to a dedicated file systems impacted user productivity due to center-wide shared file system required dramatically im- costly data transfers and added substantially to the total proved resiliency and fault-tolerance mechanisms. This system deployment cost. paper details our efforts in designing, deploying, and Recognizing the above problems, OLCF initiated the operating Spider. Through a phased approach of re- Spider project to deploy a center-wide parallel file sys- search and development, prototyping, deployment, and tem capable of serving the needs of its computational transition to operations, this work has resulted in a num- resources. The project had a number of ambitious goals: ber of insights into large-scale parallel file system ar- chitectures, from both the design and the operational 1. To provide a single shared storage pool for all com- perspectives. We present in this paper our solutions to putational resources. issues such as network congestion, performance base- lining and evaluation, file system journaling overheads, 2. To meet the aggregate performance and scalability and high availability in a system with tens of thousands requirements of all OLCF platforms. of components. We also discuss areas of continued chal- lenges, such as stressed metadata performance and the 3. To provide resilience against system failures inter- need for file system quality of service alongside with our nal to the storage system as well as failures of any efforts to address them. Finally, operational aspects of computational resources. managing a system of this scale are discussed along with real-world data and observations. 4. To allow growth of the storage pool independent of the computational platforms. 1 The OLCF began a feasibility evaluation in 2006, power supply (UPS). Figure 1 illustrates the internal initially focusing on developing prototype systems and architecture of a DDN S2A9900 couplet and Figure 2 integrating these systems within the OLCF. While the shows the overall Spider architecture. OLCF was the primary driver of this initiative, partner- ships with Cray, Sun Microsystems, and DataDirect Net- Disk Controller 1 Disk Controller 2 works (DDN) were developed to achieve the project’s ambitious technical goals. A Lustre Center of Excel- ... ... Channel A 1A 2A 14A 15A 16A 28A Channel A lence (LCE) at ORNL was established as a result of our ... ... 1B 2B 14B 15B 16B 28B partnership with Sun. Channel B Channel B This paper presents our efforts in pushing Spider to- ward its pre-designated goals, and the remainder of it is ... ... Channel P 1P 2P 14P 15P 16P 28P Channel P organized as follows: Section 2 provides an overview of ... ... the Spider architecture. Key challenges and their reso- Channel S 1S 2S 14S 15S 16S 28S Channel S ... ... lutions are described in Section 3. Section 4 describes Tier1 Tier 2 Tier 14 Tier 15 Tier 16 Tier 28 our approaches and challenges in terms of day-to-day operations since Spider was transitioned to full produc- Figure 1: Internal architecture tion in June 2009. Finally, conclusions are discussed in of a S2A9900 couplet. Section 5. SION IB Network 2 Architecture 192 4x DDR IB 2.1 Spider Connections 192 Spider OSS Servers Spider is a Lustre-based [21, 25] center-wide file sys- 192 4x DDR IB tem replacing multiple file systems within the OLCF. It Connections provides centralized access to Petascale data sets from all OLCF platforms, eliminating islands of data. 96 DDN 9900 Couplets Unlike many existing storage systems based on lo- cal high-performance RAID sets for each computation platform, Spider is a large-scale shared storage cluster. 48 DDN S2A9900 [6] controller couplets provide stor- age which in aggregate delivers over 240 GB/s of band- width and over 10 Petabytes of RAID 6 formatted ca- pacity from 13,440 1 Terabyte SATA drives. 7 - 8+2 Tiers The DDN S2A9900 couplet is composed of two sin- Per OSS glets (independent RAID controllers). Coherency is loosely maintained over a dedicated Serial Attached Figure 2: Overall Spider archi- Storage (SAS) link between the controllers. This is suf- tecture. ficient for insuring consistency for a system with write- back cache disabled. An XScale processor [9] man- The DDN storage is accessed through 192 Dell dual- ages the system but is not in the direct data path. In socket quad-core Lustre OSS (object storage servers) our configuration, host-side interfaces in each singlet providing over 14 Teraflop/s in performance and 3 Ter- are populated with two dual-port 4x Double Data Rate abytes of memory in aggregate. Each OSS can provide (DDR) InfiniBand (IB) Host Channel Adapters (HCAs). in excess of 1.25 GB/s of file system level performance. The back-end disks are connected via ten SAS links to Metadata is stored on 2 LSI Engino 7900s (XBB2) [11] each singlet. For Spider’s SATA based system, these and is served by 3 Dell quad-socket quad-core systems. SAS links connect to expander modules within each disk Each DDN S2A9900 couplet is configured with 28 shelf. The expanders then connect to SAS-to-SATA RAID 6 8+2 tiers. Four OSSs provide access to these 28 adapters on each drive. All components have redun- tiers (7 each). OLCF compute platforms are configured dant paths. Each singlet and disk tray has dual power- with Lustre routers to access Spider’s OSSs. supplies, one of which is protected by an uninterrupted All DDN controllers, OSSs, and Lustre routers are 2 configured in partnered failover pairs to ensure filesys- Aggregation 2 tem availability in the event of any component failure. 60 DDR 5 DDR IBV Connectivity between Spider and OLCF platforms is Lens/Everest Smoky provided by our large-scale 4x DDR IB network, dubbed 32 DDR the Scalable I/O network (SION). Core 1 Core 2 64 DDR 64 DDR Moving toward a centralized file system requires in- IBV IBV IBV creased redundancy and fault tolerance. Spider is de- Aggregation 1 signed to eliminate single points of failure and thereby maximize availability. By using failover pairs, mul- 24 DDR 24 DDR tiple networking paths, and the resiliency features of Jaguar XT4 the Lustre file system, Spider provides a reliable high- segment performance centralized storage solution greatly en- hancing our capability to deliver scientific insight. 96 DDR 96 DDR Jaguar XT5 On Jaguar XT5, 192 Cray Service I/O (SIO) nodes segment are configured as Lustre routers. Each SIO node has 8 AMD Opteron cores and 8 GBytes of RAM and are con- 48 Leaf Switches 96 DDR 96 DDR nected to Cray’s SeaStar2+ [3, 4] network. Each SIO IBV is connected to SION using Mellanox ConnectX [22] host channel adapters (HCAs). These Lustre routers al- 192 DDR low compute nodes within the SeaStar2+ torus to access the Spider file system at speeds in excess of 1.25 GB/s per compute node. The Jaguar XT4 partition is simi- Spider larly configured with 48 Cray SIO nodes acting as Lus- tre routers. In aggregate, the XT5 partition has over 240 Figure 3: Scalable I/O Network (SION) architecture. GB/s of storage throughput while XT4 has over 60 GB/s. Other OLCF platforms are similarly configured. fourth provides connectivity to all other OLCF via the 2.2 Scalable I/O Network (SION) aggregation switch. The Spider is connected to the core switches via 48 24-port IB switches allowing storage In order to provide true integration among all systems to be accessed directly from SION. Additional switches hosted by OLCF, a high-performance, large-scale IB [8] provide connectivity for the remaining OLCF platforms. network, dubbed SION, has been deployed. SION pro- SION is routed using OpenSM [24] with its min-hop vides advanced capabilities including resource sharing routing algorithm with an added list of IB GUIDs that and communication between the two segments of Jaguar have their forwarding entries placed in order before the and Spider, and real time visualization, streaming data general population.