The Spider Center Wide File System; From Concept to Reality

Galen M. Shipman, David A. Dillow, Sarp Oral, Feiyi Wang

National Center for Computational Sciences, Oak Ridge National Laboratory Oak Ridge, TN 37831, USA, {gshipman,dillowda,oralhs,fwang2}@ornl.gov

Abstract

The Leadership Computing Facility (LCF) at Oak Ridge National Laboratory (ORNL) has a diverse portfolio of computational resources ranging from a petascale XT4/XT5 simulation system (Jaguar) to numerous other systems supporting development, visualization, and data analytics. In order to support vastly different I/O needs of these systems Spider, a -based center wide file system was designed and deployed to provide over 240 GB/s of aggregate throughput with over 10 Petabytes of formatted capacity. A multi-stage InfiniBand network, dubbed as Scalable I/O Network (SION), with over 889 GB/s of bisectional bandwidth was deployed as part of Spider to provide connectivity to our simulation, development, visualization, and other platforms. To our knowledge, while writing this paper, Spider is the largest and fastest POSIX-compliant parallel file system in production. This paper will detail the overall architecture of the Spider system, challenges in deploying and initial testings of a file system of this scale, and novel solutions to these challenges which offer key insights into file system design in the future.

1 Introduction platform. Aggregate bandwidth of 200 GB/sec has been demonstrated using the Lustre file system on In 2008 the Leadership Computing Facility Jaguar XT5 and work is ongoing to reach our target (LCF) at Oak Ridge National Laboratory (ORNL) (baseline) bandwidth of 240 GB/sec. deployed a 1.38 Petaflop XT5 supercom- Parallel file systems on leadership class machines puter [2], dubbed Jaguar, providing the world’s have traditionally been tightly coupled to a single most powerful HPC platform for open-science appli- simulation platform. This approach has resulted in cations. In addition to Jaguar XT5 LCF also hosts the deployment of a dedicated file system for each an array of other computational resources such as computational platform at LCF. These dedicated Jaguar XT4, visualization systems and application file systems have created islands of data within LCF. development systems. Each of these systems require Users working on a visualization system such as a high performance scalable file system for stable Lens cannot directly access data generated from a storage. large scale simulation on Jaguar and must instead Meeting the aggregate file system performance resort to transferring data over the LAN. Maintain- requirements of these systems is a daunting chal- ing even two separate namespaces is a distraction lenge. Using system memory as the primary driver for users. In addition to poor usability, dedicated of file system bandwidth resulted in a requirement file systems can make up a substantial percentage of of 240 GB/sec throughput for Jaguar XT5. Achiev- total system deployment cost often exceeding 10%. ing this level of performance requires a parallel The LCF recognized that deploying dedicated file file system that can utilize thousands of magnetic systems for each computational platform was nei- disks concurrently; the Lustre [4] parallel file sys- ther cost effective nor manageable from an opera- tem provides this capability on the Jaguar XT5 tional perspective. With this in mind the LCF initi-

1

Cray User Group 2009 Proceedings ated the Spider project to deploy a center wide par- 2 Architecture allel file system capable of serving the needs of the largest scale computational resources at the LCF. 2.1 Spider The Spider project had a number of ambitious goals: Spider, a Lustre-based center-wide file system, 1. To provide a single storage pool for all LCF will replace multiple file systems on the LCF net- computational resources. work with a single scalable system. Spider provides centralized access to petascale data sets from all 2. To meet the and performance scalability re- LCF platforms, eliminating islands of data. File quirements of the largest and all LCF plat- transfers among LCF resources will be unnecessary. forms. Transferring petascale data sets between Jaguar and the visualization system, for example, could take 3. To provide resilience in the face of system fail- hours between two different file systems, tying up ures both internal to the storage system as well bandwidth on Jaguar and slowing simulations in as failures of external systems such as Jaguar progress. Eliminating such unnecessary file trans- XT5. fers will improve performance, usability, and cost. Data analytics platforms will benefit from the high bandwidth of Spider without requiring a large in- 4. Allow upgrades with minimum reconfiguration vestment in dedicated storage. effort to allow sustained growth of the storage Unlike previous storage systems, which are sim- pool independent of the computational plat- ply high-performance RAID sets connected directly forms. to the computation platform, Spider is a large-scale storage cluster. 48 DDN S2A9900 controllers pro- The LCF began evaluating the feasibility of vide the object storage which in aggregate provides meeting these goals in 2006. Initial work focused on over 240 GB/s of bandwidth, over 10 petabytes of developing prototype systems and integrating these RAID6 formatted capacity from 13,440 1-terabyte systems within the LCF. While the LCF was the pri- SATA drives. mary driver of this initiative, in order to achieve the The DDN S2A9900 is an update to the S2A9550 technically ambitious goals dictated by the Spider product. The couplet is composed of two singlets. project partnerships with Cray, Sun microsystems, Coherency is loosely maintained over a dedicated and Data Direct Networks (DDN) were developed. Serial Attached Storage (SAS) link between the con- The Lustre Center of Excellence (LCE) at ORNL trollers. This is sufficient for insuring consistency was established as a result of our partnership with for a write-back cache disabled system. An XScale Sun. A primary activity of the LCE is to improve processor manages the system but is not in the di- Lustre scalability, performance, and reliability for rect data path. Host-side interfaces in each singlet our leadership class computational platforms. To can be populated with two dual-port 4x DDR IB this end the LCE has 3 on-site Lustre file system HCAs or two 4Gb FC HBAs. The back-end disks engineers and 1 on-site Lustre application I/O spe- are connected via ten SAS links on each singlet. For cialist. Through the LCE and our partnership with a SATA based system, these SAS links connect to Cray, Sun, and DDN, the LCF has worked through expander modules within each disk shelf. The ex- a number of complex issues in the development of panders then connect to SAS-to-SATA adapters on the Spider file system which will be elaborated in each drive. All components have redundant paths. Section 3. Each singlet and disk tray has dual power-supplies This paper describes the Spider parallel file sys- where one power supply is powered by the house tem and our efforts in taking this system from con- power and the other by the UPS. Figure 1 illustrates cept to reality. The remainder of this paper is or- the internal architecture of a DDN S2A9900 couplet ganized as follows: Section 2 provides an overview and Figure 2 shows the overall Spider architecure. of the Spider file system architecture. A number of The DDN S2A9900 utilizes FPGAs to handle the key challenges and their resolutions are described pipeline of data from hosts to back-end disks. Dif- in Section 3. Finally, conclusions are discussed in ferent FPGAs handle segmenting the data, perform- Section 4 followed by future work in Section 5. ing the Reed-Solomon (RAID6) encoding and de-

CUG 2009 Proceedings 2 of 9 SION IB Network

192 4x DDR IB Connections

192 Spider OSS Servers 192 4x DDR IB Connections

96 DDN 9900 Couplets

Figure 1: Internal architecture of a S2A9900 couplet

coding, and queuing the required disk commands. As a result of this architecture, the system is fairly rigid. The system requires all 10 back-end channels 7 - 8+2 Tiers to be connected to disk trays for even the small- Per OSS est configuration. The system only supports an 8+2 RAID configuration either in a RAID 5 or a Figure 2: Overall Spider architecture RAID 6 configuration. Our system is configured with RAID 6. However, the system is highly effi- cient and since additional checks are performed in hardware on-the-fly, they introduce very little over- Lustre routers to access Spider as if the storage was head. For example, enabling parity-check on read locally attached. All other Lustre components re- operations has shown to have negligible impact on side within the Spider infrastructure providing ease read-performance. In fact, the check is enabled by of maintenance as detailed above. Multiple routers default. are configured for each platform to provide perfor- This object storage is accessed through 192 Dell mance and redundancy in the event of a failure. dual-socket quad-core Lustre OSS (object storage On the Jaguar XT5 partition 192 Cray service servers) providing over 14 teraflops in performance I/O (SIO) nodes, each with a dual socket AMD and 3 terabytes of memory in aggregate. Each OSS and 8 GBytes of RAM are connected to can provide in excess of 1.25 GB/s of file system Crays SeaStar2+ network via HyperTranport. Each level performance. Metadata is stored on 2 LSI En- SIO is also connected to SION using Mellanox Con- gino 7900s (XBB2) and is served by 3 Dell quad- nectX HCAs and Zarlink CX4 optical cables. These socket quad-core systems. These systems are inter- SIO nodes are configured as Lustre routers to al- connected via SION providing a high performance low compute nodes within the SeaStar2+ torus to backplane for Spider. access the Spider filesystem at speeds in excess of Each DDN S2A9900 is configured with 28 RAID 1.25 GB/s per XT5 compute node. The Jaguar 6 8+2 tiers. Four OSSs provide access to these 28 XT4 partition is similarly configured with 48 Cray tiers (7 each). OSSs are configured in failover pairs SIO nodes acting as Lustre routers. In aggregate so that in the event of an OSS failure the desig- the XT5 partition has over 240 GB/s of storage nated partner can provide access to the failed OSSs throughput while XT4 has over 60 GB/s. Other OSTs. Each OSS is connected to both DDN con- LCF platforms are similarly configured with Lustre trollers in a couplet so that it can access its OSTs routers in order to the requisite performance of a in the event of a controller failure using DM mul- balanced platform. tipath. Under such a controller failure case, while Moving towards a centralized file system requires bandwidth is reduced, the storage system remains increased redundancy and fault tolerance. Spider is accessible to users. LCF platform is configured with designed to eliminate single points of failure and

CUG 2009 Proceedings 3 of 9 thereby maximize availability. By using failover backplane of services. Rather than replicating in- pairs, multiple networking paths, and the resiliency frastructure services for each new deployment SION features of the Lustre file system, Spider provides will allow access to existing services thereby reduc- a reliable high-performance centralized storage so- ing total costs, enhancing usability and decreasing lution greatly enhancing our capability to deliver the time from initial acquisition to production readi- scientific insight. ness. SION is a high performance IB DDR network 2.2 SION providing over 889 GB/s of bisectional bandwidth. The core network infrastructure is based on three In order to provide a true integration between 288-port Cisco 7024D IB switches and an additional all systems hosted by the LCF, a high-performance fourth Cisco 7024D. One switch provides an aggre- large-scale IB network, dubbed as Scalable I/O Net- gation link while the other two switches provide con- work (SION), has been deployed. SION is a multi- nectivity between the two Jaguar segments and the stage InfiniBand network and enhances the current Spider file system. The forth 7024D switch pro- capabilities of the LCF. Such capabilities include re- vides connectivity to all other LCF platforms and source sharing and communication between the two is connected to the single aggregation switch. The segments of Jaguar and real time visualization as Spider is connected to the core switches via 48 24- data from the simulation platform can stream to port Flextronics IB switches allowing storage to be the visualization platform at extremely high data accessed directly from SION. Additional switches rates. Figure 3 illustrates the SION architecture. provide connectivity for the remaining LCF plat- forms.

Cisco 7024D 288 ports The LCF spans over 40,000 ft of raised floor

60 DDR 5 DDR IB space with platforms spread throughout the two- V Smoky Lens / Everest Spider phase 1 story center. In order to handle the distance re- 4 I/O servers Ewok 32 DDR quirements imposed by such a large-scale center, Zarlink IB optical cables in a number of lengths of 4 DDR 4 DDR up to 60 meters are used. These long length cables Cisco 7024D Cisco 7024D 288 ports 288 ports allowed connectivity between our two-story facility, 64 DDR 64 DDR IBV IBV IBV an impossibility with copper cables. In total SION Cisco 7024D 288 ports has over 3,000 InfiniBand (IB) [5] ports and over 3

24 DDR 24 DDR miles of optical cables providing high performance

Jaguar XT4 connectivity. Future plans for SION include an up- partition grade for a 576 GB/s bandwidth increase.

96 DDR 96 DDR Jaguar XT5 The number and complexity of the LCF systems partition deployed required an in-house solution to the IB 96 DDR 96 DDR routing configuration. Out of the box, OpenSM was

96 DDR 24 DDR IBV Leaf Switches able to provide a better routing configuration for Spider Phase 2 192 I/O servers SION than the Cisco subnet manager, such as min- imizing the number of hops per connection. How- Figure 3: Scalable I/O Network (SION) architecture ever, OpenSM assigned two of the primary destina- tions on the Spider leaf switches to share a single uplink from the core Cisco 7024D switches and this SION currently connects both segments (XT4 resulted in a 33% overall performance reduction. and XT5) of the 1.645 PFLOPS To alleviate this shortcoming, OpenSM is provided (Jaguar) and our centralized Lustre storage cluster with an ordered list of GUIDs that should have their (Spider), Lens (Visualization cluster), Ewok (end- forwarding entries placed before the general popu- to-end cluster), Smoky (application development lation. In this way, the standard algorithms that and readiness cluster), HPSS [11] and GridFTP [1] OpenSM uses to determine the minimum hop count servers. and the sharing of links remains unchanged, yet the As new platforms are deployed at the LCF, SION combination of techniques gives each primary des- will continue to scale out providing an integrated tination a dedicated link with respect to other pri-

CUG 2009 Proceedings 4 of 9 mary destinations. This solution allowed LCF to Projected failure rate for a couplet in 10−year period better utilize the deployed SION bandwidth.

● 0.6

3 Integration ● 0.5 To provide a scalable, high-performance, and re- ● 0.4 liable parallel file system at the LCF the Technology ● Integration Group began evaluation of a number of ● technologies in 2007. IB had gained considerable 0.3 Failure rate traction in the HPC space and had been demon- ● strated at scales exceeding our requirements [9]. 0.2 The availability of double data rate (DDR) Infini- ●

● Band, high port count switches and long reach (up 0.1 to 100 meter) optical cabling provided a plausible ● ●

solution to our system area network requirements. 0.0 Much of our early work on IB evaluation focused 2 4 6 8 10 on optical cable testing [6] and porting the Open- Number of Years Fabrics OFED stack to the Cray service I/O (SIO) node. Cray later provided a productized version of Figure 4: 10-year Projection of Failure Rate this work and now fully supports IB on the XT se- ries. Working closely with DDN ORNL began evalu- for these three components are shown in the follow- ation of the DDN S2A9900 storage system in 2008. ing table. Our approach is to first define the possible The LCF fielded one of the earliest examples of the Spider failures, then take into account the physical S2A9900 platform for evaluation and worked with layout of our storage system and deduce the relia- DDN to address a number of performance, stability bility graph for the couplet system: the composite and reliability issues. During this period a number system is composed of a mix of series and parallel of firmware and software level changes resulted in component connections based on the failure model substantially improving the S2A9900 storage plat- we defined, and we can reach the estimation for the form in order to meet the requirements of the Spider reliability of the composite system (one couplet). file system. Component MTTF 3.1 Reliability Analysis of the DDN I/O Module 1,263,856 S2A9900 DEM 1,552, 437 Baseboard 356,143 Necessitated by both the scale and the unique- ness of the Spider system, our integration work also Based on this failure model (with enumeration of included analysis from a system reliability perspec- six defined failure cases) as well as known compo- tive. Our goal was to establish a failure model and nent MTTF listed in the table above, we can project a quantitative expectation of the system’s reliabil- the failure rate of one couplet over a 10-year period, ity and availability. In addition to examining the as illustrated in Figure 4. We can also quantify the AFR (Annual Failure Rate), impact of UBE (Un- 10 year spread on each of the failure case in the form correctable Bit Errors), and RAID 8+2’s impact on of the box-graph in Figure 5. It shows that on aver- overall system reliability [3, 7], we also developed a age (indicated by the middle bar in the box), case 1 detailed failure model for DDN S2A9900. Particu- and case 3 have the most significant impact on the lar attention was given to the DDN S2A9900’s pe- overall failure rate. ripheral components as they are often the deciding The basic insights gained by this analysis are: factors on the overall system reliability. There are three other major components be- • The reliability of peripheral components sides disk arrays which we considered: I/O module, present the most severe impact on the overall DEM, and baseboard. The vendor supplied MTTF reliability, with the baseboard being the weak-

CUG 2009 Proceedings 5 of 9 l U n ot ahwt Uswihis which LUNs 7 with each sin- hosts a 4 with host and single transfers. LUN a 4MB using gle run and were 1MB tests read, using These performance random write show write, random sequential and and read, testing are sequential our presented of of results The summary are IB a 6. DDR Table via DDN in the summarized to connected hosts XDD 4 using from results Performance read ran- workloads. sequential write for or or fo- performance sequential experiments aggregate initial in on Our cused run mode. write be or read can block dom their XDD initiating to transfers. pro- all prior with synchronizing machines cesses different on and exist Readers can fashion. writers synchronized a a in from/to device blocks block write and/or read mul- to direct utility to processes mechanism tiple [8] a provides benchmark XDD XDD used. was the S2A9900 DDN the Perfor- of Baseline a Establishing 3.2

• Failure Rate iue5 alr aeDsrbto nEc Case Each on Distribution Rate Failure 5: Figure nodrt banabsln fpromneon performance of baseline a obtain to order In

.2 niaigahg rbblt htw will we failures. that couplet to probability experience approaching high is the a number by indicating the However, 2.42, two, number 0.64. year the of be expect end to can couplets we failed year, of first the For 50% scenarios. failure approximately possible to the contributes of It link. est 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 mance Comparison onFailureCases fail case3:onebaseboardandI/Omodule fail case1:anytwobaseboardfailures Failure Cases U 2009 CUG Proceedings ora.Tejunli hnwitnt ikand disk to written then is journal the The to blocks metadata journal. the writing by followed disk devices. MGT metadata mode only and journals journaling MDT ldiskfs backend ext3, OST, to the the Similar as on ext3) system of uses file version Lustre modified crash. (a system system ldiskfs file file a minimize after file against to times recovery modern and robustness corruptions by system data used file meta increase widely to is systems up- Journaling journal ldiskfs to dates. traced eventually were 1MB writes 4KB expected of the to number 9900 addition large DDN I/O a the the using revealed benchmark Profiling utilities IOR size. the of transfer stream 1MB a sequential XDD with of writes of metric 25.8% performance mere OSTs. baseline a different our was to performance write writing level client Lustre the each on with clients 1398.99MB/sec OSTs only 7 of poor very performance showed with write results con- initial each Our was OSSes S2A9900. Testing DDN 4 performance [10]. using level benchmark ducted Lustre IOR the examined using we XDD ing Journaling Filesystem Improve 3.3 amortized improve transfers. is to 4MB ongoing seek utilizing by are head performance Efforts the Lustre write. of larger cost a the over as 1MB 4MB from increased to op- are write sizes random transfer when and erations read improved random dramatically of the performance is presented. interest are particular configuration Performance each Of in runs column. 5 Tiers of results the in “multi” labeled 1mb ReqSize MB/s - Sum Disk 4mb fe salsigabsln fpromneus- performance of baseline a establishing After f9 of 6 single multi single multi Tiers iue6 D efrac Results Performance XDD 6: Figure 5 4 3 2 1 5 4 3 2 1 5 4 3 2 1 5 4 3 2 1 Run yfis rtn h aabok to blocks data the writing first by random read IOType 2630.69 2629.95 2628.30 2630.81 2630.94 4339.00 4342.12 4341.55 4343.48 4337.55 1MB vs 4MB vs 1MB 252.69 254.16 258.24 256.78 253.27 95.96 96.44 96.34 96.44 95.85 Page Page seq Pattern 1 rts hs small These writes. 5421.92 5916.40 5918.09 5391.05 5901.75 5352.51 5386.17 5907.87 5338.70 5894.38 509.96 471.66 498.00 476.94 468.49 411.54 484.79 585.97 483.54 478.78 random write 5490.62 2539.11 5490.71 2540.39 2539.40 2541.79 5486.23 2538.80 5476.54 5483.57 241.08 242.14 242.34 241.44 241.47 93.92 94.40 94.13 94.41 94.63 rtsin writes sn 28 using seq 5422.21 5494.29 5496.76 5480.20 5490.39 5413.06 5430.05 5379.23 5403.04 5477.88 272.06 392.12 399.96 376.91 261.35 267.35 284.03 264.43 377.63 386.55 data marked as committed. In the worst case this can by pairing the SIO node with its associated DDN result in 2 4KB writes (and 2 head syncs) with ev- controller on a crossbar congestion in the IB fat- ery 1MB write. Due to the poor IOP performance tree is eliminated. This configuration allows each of SATA disks these additional head syncs and small SIO node to achieve maximum bandwidth irrespec- writes substantially degraded performance. tive of the number of communicating peers which To alleviate the journaling overhead, we evalu- we verified by running XDD on 48 to 96 Cray SIO ated two potential solutions. The first was to move nodes in parallel. Performance scaled linearly. After the file system journal to an external block device. baselining performance of the storage system from The RamSan 400 was selected as a potential can- the Cray SIO nodes we then conducted a series of didate for an external journal device. The Ram- tests using IOR from the Cray compute nodes. Ini- San 400 is an InfiniBand connected storage system tial results showed a high variability in performance which utilizes DRAM as the storage media. This between successive runs even on a quiesced system. provides an extremely high IOP solution and due It was also noted that read performance was always to our relatively modest storage requirements of ap- substantially lower than write performance. Our proximately 400MB per OST device the RamSan suspicion was that placement of objects on OSTs was a potential solution. By using the RamSan as relative to the corresponding clients location in the an external journal device we were able to isolate all torus could result in pathological cases resulting in 4KB journal traffic to the high IOP storage device torus congestion. That is to say a Lustre client on a while allowing sequential data blocks to flow to the compute node may allocate objects on an OST that low IOP SATA disks. By utilizing external journals is topologically far away. In addition, it was noted we were able to achieve 3292.6MB/sec or 60% of that the dramatic performance difference between our baseline performance. reads and writes may also be a result of torus con- A second solution was later proposed to decrease gestion as the data paths over the network are not the impact of journal updates. Rather than open symmetric for read and write operations. the journal, write data blocks and then close the To test our theory that SeaStar+ network con- journal on every write, ldiskfs was modified to up- gestion was substantially impacting performance of date the journal asynchronously after a potentially the file system a mechanism was devised that al- large number of writes. Utilizing this approach lowed clients to allocate objects on OSTs that are resulted in dramatically fewer 4KB updates (and topologically near the client. Performance was then fewer head seeks) which substantially improved per- measured using IOR with each client writing to a formance to over 4625MB/s or 85% of our baseline single file on an OST that was topologically near. performance. Performance was improved substantially as illus- trated in Figure 7. In Figure 7 “default read” and 3.4 Network Congestion Control “default write” performance was obtained using the IOR benchmark using the default object allocation After improving Lustre level performance by over policy of Lustre. The performance results of both 330% on a single DDN S2A9900 we began to focus “placed read” and “placed write” were obtained us- our efforts on examining performance at scale utiliz- ing IOR and preallocating files on OSTs topologi- ing half of the available Spider storage and Jaguar cally near the client writing to this file. XT5. Initially we configured the storage as a ded- icated file system on Jaguar XT5. In this configu- Having demonstrated that network congestion ration each Cray SIO node acts as a Lustre OSS as can severely impact aggregate file system perfor- opposed to a Lustre router. This allowed us to work mance when the Cray SIO node was configured as through a number of stability and performance is- a Lustre OSS we then began tackling this prob- sues within Lustre without the additional complex- lem in the context of the routed Spider configura- ity of a routed configuration. To baseline perfor- tion. Lustre clients and servers spread load amongst mance of the storage system from Jaguar XT5 we routers based on queue depths on the router. Un- used XDD run from the Cray SIO nodes. In or- fortunately there is no mechanism to detect conges- der to minimize network congestion on the IB fab- tion between a client and a router or a server and ric each Cray SIO node was paired with its asso- a router and prefer routers that minimize network ciated DDN controller on an IB line card on the congestion. Recognizing this as a severe architec- core switch. Each line card is a 24 port crossbar, tural limitation we began working with Sun to pro-

CUG 2009 Proceedings 7 of 9 Placed vs. Non-Placed 140000 xdd read xdd write 120000 placed write default write Infiniband Client (Group A) placed read Router (Group A) 100000 default read OSS (Group A) 80000 SeaStar Client (Group B) Network

MBytes 60000 Router (Group B)

OSS (Group B) 40000

IB Line Card 20000

0 0 10 20 30 40 50 60 70 80 90 100 Number of OSSes Figure 8: Lustre Fine Grain Routing Figure 7: Performance on Jaguar XT5

Spider concurrently equating to over 26,000 Lus- vide a mechanism for clients and servers to prefer tre clients and over 180,000 processing cores. To specific routers. This mechanism would allow us to our knowledge this is the largest number of clients pair clients with routers within the SeaStar+ torus to mount a single Lustre file system to date. In in order to minimize congestion within the torus. In conducting this testing a number of issues were re- essence we can view a set of 32 routers as a repli- vealed. As the number of clients mounting the file cated resource providing access to every OST in the system increased the memory footprint of Lustre file system such that congestion in the IB network grew at an unsustainable rate. As the memory foot- is minimized. With 192 total routers and 32 routers print on the OSSes grew past 11GB we began to get in each set we have 6 replicated resources within the out-of-memory errors (OOMs) on the OSS nodes. SeaStar torus. By grouping these 32 routers appro- By analyzing the memory allocations from Lustre priately we can then assign clients to these routers it was discovered that an extremely large number such that communication is localized to a sub 3-D of 64KB buffers were being allocated. Reviewing mesh of the torus. This strategy should reduce con- the Lustre code base revealed that 40KB memory tention in the SeaStar+ torus based on our previous allocations were made, 1 for each client-OST con- results on OST placement. Work is ongoing to val- nection which resulted in 64KB memory allocations idate this theory. Figure 8 illustrates this concept within the kernel. With 7 OSTs per OSS and using two routing groups on the SeaStar+ network over 26,000 clients this equated 26000 ∗ 7 ∗ 64KB = each with two routers. Lustre clients in the first 11.1GB of memory per OSS for server side client group indicated by the color green will prefer the statistics alone. As client statistics are also stored routers indicated by the color yellow. The yellow on the client, a much more scalable solution as each routers can access all storage via an IB cross bar client would only store 7 ∗ 64KB = 448KB in our without resorting to traversing the IB fat-tree. In configuration, we removed the server side statistics a similar fashion clients in blue can utilize the red entirely. Figure 9 illustrates the server side memory routers to access any storage. Contention on both footprint as a function of number of clients in our the SeaStar+ network and the IB network is there- initial configuration. for minimized. 3.6 Fault Tolerance 3.5 Scalability As Spider is a center wide resource much of our In order to verify the stability of the Spider testing at full scale centered on surviving compo- file system at full scale testing was conducted us- nent failures. A major concern was the impact of ing 4 major systems at the LCF which included an unscheduled outage of a major computational re- the Jaguar XT5 and Jaguar XT4 partition, Lens source on the Spider file system. To test the impact and Smoky. All systems were configured to mount of this we mounted the Spider file system on all the

CUG 2009 Proceedings 8 of 9 Client Statistics Memory Footprint on OSS Hard bounce of 7844 nodes via 48 routers 120 12000 Combined R/W MB/s Combined R/W IOPS

Bounce XT4 @ 206s 10000 100

8000 80 OST Evicitions Full I/O @ 524s

6000 60

MBytes I/O returns @ 435s

4000 40

RDMA Timeouts Percent of observed peak {MB/s,IOPS} 2000 20

Bulk Timeouts 0 0 0 5000 10000 15000 20000 25000 0 100 200 300 400 500 600 700 800 900 Number of Clients Elapsed time (seconds)

Figure 9: Memory footprint of client statistics Figure 10: Impact of Jaguar XT4 Reboot

4 Conclusions

In collaboration with Cray, SUN, and DDN the compute nodes spanning the Jaguar XT4 partition, Technology Integration group at ORNL has success- Lens and Smoky. With an I/O workload active on fully architected, integrated and deployed a center Jaguar XT4, Smoky, and Lens the Jaguar XT4 sys- wide file system capable of supporting over 26,000 tem was rebooted. Shortly thereafter the file sys- clients and delivering in excess of 200 GB/sec of file tem became unresponsive. Postmortem analysis of system bandwidth. To achieve this goal a number of this event showed that the OSSes spent a substan- unique technical challenges were met. Designing a tial amount of time processing client evictions. Us- system of this magnitude required careful analysis of ing the DDN S2A9900 performance analysis tools failure scenarios, fault tolerance mechanisms to deal we observed a large number of small writes to each with these failures, scalability of system software OST during this eviction processing with very lit- and hardware components, and overall system per- tle other I/O progressing on the system. This was formance. Through a phased approach of deploying later tracked down to a synchronous write to each and evaluating prototype systems, deployment of a OST for each evicted client resulting in a backlog large scale dedicated file system followed by a tran- of client I/O from Lens and Smoky. Changing the sition to the Spider file system, we have delivered client eviction code to use an asynchronous write one of the world’s highest performance file systems. resolved this problem and in later testing allowed The Spider file system is now in limited access pro- us to demonstrate the file systems ability to with- duction use at the LCF by ”early science” projects stand a reboot of either Jaguar XT4 or Jaguar XT5 on the Jaguar XT5 partition. Spider will transition with minimal impact to other systems with active to full production in the Summer of 2009. I/O. Figure 10 illustrates the impact of a reboot of Jaguar XT4 with active I/O on Jaguar XT5, Lens and Smoky. The y-axis shows the percent- 5 Future Work age of peak aggregate performance throughout the experiment. At 206 seconds elapsed time Jaguar While a large number of technical challenges have XT4 is rebooted. RPCs timeout and performance been addressed during the Spider project a num- of the Spider file system degrades substantially for ber still remain. Performance degradation during approximately 290 seconds. Aggregate bandwidth an unscheduled system outage may last for up to improves at 435 seconds and steadily increases un- 5 minutes as detailed in Section 3. Technology In- til we hit the new steady state performance at 524 tegration is working closely with file system engi- seconds. Aggregate performance does not return to neers in the LCE in order to minimize or elimi- 100% due to the mixed workload on the systems and nate this degradation entirely. Fine grain routing the absence of Jaguar XT4 I/O load. in the LNET layer is currently accomplished by cre-

CUG 2009 Proceedings 9 of 9 ating multiple distinct LNET networks. We have begun to test this approach and have had some suc- cess although the complexity of the configuration is daunting. Rather than using separate LNET net- works to achieve fine grained routing a routing table approach may ease the complexity of configuration while giving tighter control over individual routes. This would require significant changes within LNET and as such each approach must be evaluated care- fully.

References

[1] W. Allcock, J. Bresnahan, R. Kettimuthu, and M. Link. The Globus Striped GridFTP Frame- work and Server. In SC ’05: Proceedings of the 2005 ACM/IEEE conference on Supercomputing, page 54, Washington, DC, USA, 2005. IEEE Com- puter Society. [2] A. Bland, R. Kendall, D. Kothe, J. Rogers, and G. Shipman. Jaguar: The worlds most powerful computer. In Proceedings of the Cray User Group Conference, 2009. [3] G. A. Gibson and D. A. Patterson. Designing disk arrays for high data reliability. J. Parallel Distrib. Comput., 17(1-2):4–27, 1993. [4] S. M. Inc. Luste wiki. http://wiki.lustre.org, 2009. [5] Infiniband Trade Association. Infiniband Architec- ture Specification Vol 1. Release 1.2, 2004. [6] M. Minich. Inniband Based Cable Comparison, June 2007. [7] D. A. Patterson, G. A. Gibson, and R. H. Katz. A Case for Redundant Arrays of Inexpensive Disks (RAID). Technical report, Berkeley, CA, USA, 1987. [8] T. Ruwart. XDD. http://www.ioperformance. com/, 2009. [9] Sandia National Laboratories Technical Re- port. Thunderbird Linux Cluster ranks 6th in Top500 supercomputing Race. http://www.sandia.gov/news/resources/ releases/2006/thunderbird.html. [10] H. Shan and J. Shalf. Using IOR to analyze the I/O performance of XT3. In Proceedings of the 49th Cray User Group (CUG) Conference 2007, Seattle, WA, 2007. [11] The HPSS Collaboration. HPSS. http://www. hpss-collaboration.org/.

CUG 2009 Proceedings 10 of 9