End-to-end I/O Portfolio for the Summit Supercomputing Ecosystem

Sarp Oral, Sudharshan S. Vazhkudai, Feiyi Wang, Christopher Zimmer, Christopher Brumgard Jesse Hanley, George Markomanolis, Ross Miller, Dustin Leverman, Scott Atchley, and Veronica Vergara Larrea

ORNL is managed by UT-Battelle, LLC for the US Department of Energy ORNL Summit System Overview System Performance The system includes Each node has • Peak of 200 Petaflops • 4,608 nodes • 2 IBM POWER9 processors (FP ) for modeling & 64 • Dual-rail Mellanox EDR • 6 V100 GPUs simulation InfiniBand network • 608 GB of fast memory • Peak of 3.3 ExaOps (FP16) • 250 PB IBM file system (96 GB HBM2 + 512 GB DDR4) for data analytics and transferring data at 2.5 • 1.6 TB of non-volatile artificial intelligence TB/s memory

2 Open slide master to edit Summit Node Schematic • Coherent memory across DRAM DRAM

7 TF 256 GB 256 GB 7 TF GPU GPU HBM HBM

16 GB 16 GB entire node 900 900 GB/s 900 GB/s • NVLink v2 fully 50 GB/s 50 GB/s

50 GB/s50 GB/s50 interconnects three GPUs 135 GB/s 135 64 GB/s 135 GB/s and one CPU on each P9 P9 7 TF 7 TF GPU GPU HBM HBM

16 GB 16 GB side of node 50 GB/s50 GB/s50 900 900 GB/s 900 GB/s

16 GB/s • PCIe Gen 4 connects 50 GB/s 50 GB/s 16 GB/s NVM and NIC 50 GB/s50 GB/s50 • Single shared NIC with NIC 7 TF 7 TF GPU GPU HBM HBM 16 GB 16 GB 900 900 GB/s 900 GB/s dual EDR ports

6.0 GB/s Read NVM 2.2 GB/s Write 12.5 GB/s12.5 GB/s12.5

TF 42 TF (6x7 TF) HBM/DRAM Bus (aggregate B/W) HBM 96 GB (6x16 GB) NVLINK DRAM 512 GB (2x16x16 GB) X-Bus (SMP) NET 25 GB/s (2x12.5 GB/s) PCIe Gen4 MMsg/s 83 EDR IB HBM & DRAM speeds are aggregate (Read+Write). All other speeds (X-Bus, NVLink, PCIe, IB) are bi-directional. 3 Open slide master to edit OLCF I/O Requirements

• POSIX API – ~1,300 users access and request POSIX • Usability and transparency – No data islands, center-wide access and no application code changes OLCF workloads do not • Capacity distinguish between – O(100)s PB; min 30x of Summit CPU memory; 80x desired checkpoints and data • I/O performance outputs, and persist all output – O(TB/s); 50% of total CPU memory to be written in 5 minutes data • N-N and N-1 I/O support • ML workload support – High small-read IOPS • Cost efficient design – Not to exceed 15% of the Summit cost

4 Open slide master to edit OLCF I/O solutions architecture

GPU1 GPU1 GPU1 GPU1

GPU2 P9 P9 GPU2 Summit Compute Nodes GPU2 P9 P9 GPU2 (4,608) GPU3 PCIe Gen4 GPU3 . . . GPU3 GPU3 /XFS /XFS NVM Rates: 5.8 GB/s Read In-system storage layer 2.1 GB/s Write NVM NVM 1 Million Read IOPS 1.6 TB Aggregate Capacity: 7.4 PB 170K Write IOPS Aggregate Write BW: 9.7 TB/s /SymphonyFS: FUSE- Mellanox Infiniband based Distributed FS EDR Fat-tree Network 2.5 TB/s Spectral Intercept Library /Spider3 Center-Wide PFS

NSD I/O NSD I/O Capacity: 250 PB NSD I/O NSD I/O server server BW: 2.5 TB/s server server . . . 2.5 TB/s IBM Spectrum Scale GL4s to other Redundancy Redundancy (77) Redundancy Redundancy OLCF Group 1 Group 2 Group 1 Group 2 systems (211 HDDs) (211 HDDs) (211 HDDs) (211 HDDs)

5 Open slide master to edit Design challenges

• Not possible to order from a catalogue • Provide a transparent mount point – GPFS at the time of the selection (<2014) – Users don’t like multiple mount points • Largest deployment was 18 PB – Users don’t like searching data in multiple • Highest throughput was 400 GB/s places • Highest create/s per single directory was 5 • Need a solution combining node-local – Multi-year collaboration with IBM and center-wide POSIX mount points • Identify performance bottlenecks • Need a solution moving data from • Verify design and implementation node-local mount points to GPFS • Deployment and acceptance issues transparently

6 Open slide master to edit Center-wide parallel file system (PFS)

• Spider 3/Alpine • Each GL4 – POSIX namespace, shared – 2 P9 based NSD servers – IBM SpectrumScale/GPFS – 4 106 slot disk enclosures • 77 ESS GL4, w/ O(30K) 10TB NL-SAS – 12 Gbps SAS connected (NSD – enclosure) • IB EDR connected – 422 disks in total organized in 2 distributed – 250 PB usable, formatted RAID sets • ∼90x of 2.8 PB DDR+HBM of Summit • Each NSD – 2.5 TB/s aggregate sequential write/read – 2 IB ConnectX-5 EDR ports connect to – 2.2 TB/s aggregate random write/read Summit – 800K/s 32KB file transactions – 2 IB ConnectX-5 EDR ports connect to the rest of OLCF • create/open+write+close – ~30K 0B file create in a shared directory

7 Open slide master to edit Summit in-system storage layer

• Each Summit compute node can write @ 12.5 GB/s to Alpine – Max out Alpine w/ 200 Summit compute nodes • Each Summit node has a 1.6 TB Samsung PM1725a NVMe, exclusive – 6 GB/s read and 2.1 GB/s write I/O performance – 5 drive writes per day (DWPD) – Formatted as XFS (node-local file system) • In aggregate Summit in-system storage layer – 7.4 PB @ 26.7 TB/s read and 9.7 TB/s aggregate write I/O performance – 4.6 billion IOPS in aggregate – 2.5 times the capacity of aggregate system DRAM and HBM • Question is how to aggregate and present this in software as an effective I/O solution?

8 Open slide master to edit Spectral

• An I/O intercept library interposing between application and Spider 3 – https://code.ornl.gov/cz7/Spectral

• No application code changes required and supports N-N and N-1workloads

• All writes redirected to local XFS, later to be transferred to a specified location on Spider 3 by an environment variable in the job script

• Does not add extra metadata on the I/O critical path

• Detects the close() of the file handle and enqueues the data transfer to the data mover process that runs on the isolated system cores

• Maintains a log of files in motion so that when an application exits it can use a call to the Spectral wait tool, to hold the job open until the remaining files are transferred

• Feature complete, code hardening for operations

9 Open slide master to edit SymphonyFS

• Fuse based file system, presenting a unified namespace of distributed local XFS filesystems and Spider 3 – https://code.ornl.gov/techint/SymphonyFS

• Provides a single mount point combining node-local and PFS namespaces

• No application code changes required and supports N-N and N-1workloads

• Metadata and read calls are directed to Spider 3, writes directed to local XFS – Relies on Spider 3 for most metadata operations and adds latency by transiting them through FUSE

• SymphonyFS daemon will then transfer the data from the local XFS namespace to the compute node local GPFS client

• Applications must either avoid Read-After-Write and overlapping writes between nodes (not very common use cases for OLCF) or employ manual intervention with mechanisms such as fsync()

• In development, improving metadata performance

10 Open slide master to edit Spectral and SymphonyFS workflow

Unified Namespace

App 1 App App App NVMe AppApp NVMe 2 2 Spectral SymphonyFS App 1 AppApp App NVMe AppApp NVMe 2 2 Spectral SymphonyFS 1 3 App 1 AppApp App NVMe AppApp NVMe 2 2 Spectral SymphonyFS

GPFS

11 Open slide master to edit Spectral basic use case

• Needs four easy changes to the application configuration – An update in the configuration file to tell the application to write to a path on the local XFS filesystem on node-local NVMe SSD – Setting PERSIST_DIR, the base directory location on local XFS where the application will write these files – Setting a location for Spectral to move the files to, on Spider 3, noted as PFS_DIR – Notifying the runtime system using the module and alloc_flags parameters to load the spectral libraries

12 Open slide master to edit OLCF end-to-end I/O solutions

End-to-end I/O Portfolio for the Summit Supercomputing Ecosystem SC ’19, November 17–22, 2019, Denver, CO, USA Table 1: End-to-end I/O solutions for Summit and its Ecosystem. Spectral and SymphonyFS re￿ect application-level perfor- mance.End-to-end Spectral I/O I/O Portfolio library for performsthe Summit at Supercomputing the rate of aggregate Ecosystem local XFS ￿lesystem SC performance. ’19, November 17–22, SymphonyFS 2019, Denver, performance CO, USA numbersTable 1: are End-to-end projections I/O based solutions on the for small-scale Summit and test its results Ecosystem. and re Spectral￿ect N-1 and I/O SymphonyFS workload with re￿ large-blockect application-level sizes. perfor- mance. Spectral I/O library performs at the rate of aggregate local XFS ￿lesystem performance. SymphonyFS performance Capacity Performance Supported I/O numbers are projectionsPOSIX basedUsability on the small-scale test results and re￿ect N-1 I/O workload with large-blockML/DL sizes. Support (aggregate) (aggegate) Modes Spider 3 Compliant High 250Capacity PB 2.5 TB/s forPerformance reads and writes SupportedN-N and N-1 I/O Low POSIX Usability ML/DL Support Local XFS Compliant Low 7.4 PB(aggregate) (1.6 TB/node) 24.8 TB/s for reads,(aggegate) 9.6 TB/s for writes ModesN-N High SpectralSpider 3 CompliantN/A MediumHigh 7.4250 PB PB 24.8 TB/s2.5 TB/s for reads, for reads 9.6 TB/s and writes for writes N-NN-N and N-1 LowHigh SymphonyFSLocal XFS CompliantCompliant HighLow 7.4 PB7.4 (1.6 PB TB/node) 2.524.8 TB/s TB/s for for reads, reads, 9.6 9.6 TB/s TB/s for for writes writes N-NN-N and N-1 HighHigh Spectral N/A Medium 7.4 PB 24.8 TB/s for reads, 9.6 TB/s for writes N-N High SymphonyFS Compliant High 7.4 PB 2.5 TB/s for reads, 9.6 TB/s for writes N-N and N-1 High SymphonyFS. Once SymphonyFS has completed the call, it issues a accelerate the writes by using the in-system storage through the response to the FUSE client library that writes it to /dev/fuse. The SymphonyFS mount point. As shown in Step 1 of the ￿gure, all FUSESymphonyFS. module evaluates Once SymphonyFS the result and has uses completed the VFS thelayer call, to itrespond issues a ￿accelerateles created the through writes the by mount using the point in-system are created storage on Spider through 3 with the to theresponse application to the via FUSE the client system library call. that writes it to /dev/fuse. The SymphonyFSSymphonyFS acting mount as point. an intermediary. As shown in Step Once 1 SymphonyFS of the ￿gure, allhas BeyondFUSE module the callbacks, evaluates SymphonyFS the result and contains uses the VFS several layer asynchro- to respond scheduled￿les created the through writes to the the mount in-system point are layer created (Step on 2), Spider LAMMPS 3 with can nousto components the application including via the a system write-back call. cache engine and drainer. continueSymphonyFS processing acting while as an SymphonyFS intermediary. performs Once SymphonyFS a slow drain has and Beyond the callbacks, SymphonyFS contains several asynchro- scheduled the writes to the in-system layer (Step 2), LAMMPS can The write-back cache engine handles taking the data passed from reconciliation to Spider 3 (Step 3). LAMMPS does not need to issue nous components including a write-back cache engine and drainer. continue processing while SymphonyFS performs a slow drain and FUSE and transferring them onto the in-system storage device. any instructions for SymphonyFS to drain the data to Spider 3, and The write-back cache engine handles taking the data passed from reconciliation to Spider 3 (Step 3). LAMMPS does not need to issue Depending on which engine is selected, this process can be syn- even if LAMMPS terminates unexpectedly on the next time step, FUSE and transferring them onto the in-system storage device. any instructions for SymphonyFS to drain the data to Spider 3, and chronousDepending or asynchronous on which engine to the is application selected, this write. process We can currently be syn- SymphonyFSeven if LAMMPS will persistterminates the dataunexpectedly before the on job the ends. next Once time step, a job 13 supportchronous synchronous or asynchronous POSIX and to the asynchronous application POSIX,write. We Linux currently and hasSymphonyFS￿nished writing will persist all data the for data the before restart, the￿les job will ends.Open be Once availableslide master a job in to edit Splicesupport AIO engines.synchronous Each POSIX engine and supports asynchronous individualized POSIX, Linux options, and itshas intended￿nished form writing for all the data user. for the restart, ￿les will be available in butSplice the principal AIO engines. options Each are the engine drain_low_water_mark, supports individualized the options,drain_ its intended form for the user. high_water_mark,but the principal the options num_io_requests, are the drain_low_water_mark, and max_bu￿er_size. the drain_ The 4.5 Comparative Analysis drain_low_water_markhigh_water_mark, the controls num_io_requests, how much dataand max_bu needs to￿er_size. be present The Table4.5 1 Comparative provides a comparative Analysis summary of the four end-to-end on thedrain_low_water_mark in-system layer before controls initiating how draining, much data preventing needs to be overus- present I/OTable solutions 1 provides of the a comparative I/O subsystem. summary of the four end-to-end ingon resources the in-system to drain layer small before amounts initiating of draining, data. The preventing drain_high_ overus- I/OOf solutions these, Spider of the 3 I/O is the subsystem. most generic solution and can handle water_marking resources limits to the drain maximum small amounts quantity of of data. data The that drain_high_ Sympho- all typesOf these, of I/O Spider access 3 is modes the most and generic patterns solution with varying and can degreeshandle nyFSwater_mark may cache limits and when the maximum reached, triggers quantity the of maximum data that transfer Sympho- ofall performance. types of I/O accessIt accommodates modes and all patterns I/O patterns with varying including degrees large, ratenyFS to the may PFS. cache The and num_io_requests when reached, triggers option the sets maximum the maximum transfer randomof performance. I/O, hot and It accommodates warm data requests, all I/O patterns and is suitable including for large, both rate to the PFS. The num_io_requests option sets the maximum number of concurrent I/O requests to permit and assists in limiting ￿randomle-per-process I/O, hot and and shared- warm data￿le access. requests, It presents and is suitable a single, for global both number of concurrent I/O requests to permit and assists in limiting the consumption of ￿le descriptor and memory. Finally, the max_ POSIX￿le-per-process namespace and with shared- extremely￿le access. high It presents levels of a reliability. single, global De- the consumption of ￿le descriptor and memory. Finally, the max_ bu￿er_size sets the maximum data block size that can reside on signedPOSIX to namespace function as with a PFS, extremely it does not high require levels anyof reliability. code changes De- thebu NVMe￿er_size SSD sets before the generating maximum a data new block data sizeblock that for can a ￿le. reside Limit- on signed to function as a PFS, it does not require any code changes the NVMe SSD before generating a new data block for a ￿le. Limit- to applications. However, its performance is not optimal for very ing the capacity of data blocks aids in splitting the data into ￿xed to applications. However, its performance is not optimal for very ing the capacity of data blocks aids in splitting the data into ￿xed large-scale bursty I/O patterns and metadata access. quanta and aging for scheduling transfers. large-scale bursty I/O patterns and metadata access. quanta and aging for scheduling transfers. The in-system XFS ￿lesystem provides an order of magnitude When the write cache signals a data chunk to be transferred The in-system XFS ￿lesystem provides an order of magnitude When the write cache signals a data chunk to be transferred better performance compared to Spider 3, and will also exhibit linear to the PFS, it is the drainer’s responsibility to migrate the data better performance compared to Spider 3, and will also exhibit linear to the PFS, it is the drainer’s responsibility to migrate the data performanceperformance scaling scaling with with increasing increasing number number of of compute compute nodes. nodes. from the in-system storage. As with the cache engine, there are from the in-system storage. As with the cache engine, there are ItIt is is well-suited well-suited for for￿￿le-per-processle-per-process (N-N) (N-N) access access mode mode and and high high several implementations to choose from synchronous POSIX and several implementations to choose from synchronous POSIX and levelslevels of of small small read read and and write write I/O requests, as as well well as as high high levels levels of of Splice and asynchronous POSIX, Linux and Splice with options Splice and asynchronous POSIX, Linux and Splice with options metadatametadata IOPS IOPS (e.g., (e.g., ML/DL ML/DL workloads). Being Being node-local, node-local, it it does does forfor controlling controlling the the number number of of simultaneous simultaneous I/O I/O operations operations and and notnot provide provide any any reliability. reliability. Additionally, the the onus onus of of scheduling scheduling datadata operation operation sizes. sizes. The The drainer drainer supports supports I/O I/O scheduling scheduling and and rate rate datadata transfers transfers between between XFS XFS and Spider 3 3 is is solely solely on on the the users. users. limitinglimiting to moderate to moderate its its I/O I/O tra tra￿c￿ toc to PFS PFS to to improve improve fair fair sharing sharing of of SpectralSpectral is is also also well well suited suited for N-N accesses accesses and and improves improves the the resourcesresources and and FUSE FUSE performance performance during during active active write write cycles cycles (it (it has has usabilityusability of of the the in-system in-system storage storage layer by by providing providing an an easy easy mech- mech- beenbeen observed observed that that GPFS GPFS can can impede impede FUSE FUSE performance performance when when both both anismanism to to drain drain user user data data to to Spider Spider 3. It exhibits exhibits similar similar performance performance areare active). active). Once Once migrated, migrated, the the drainer drainer invokes invokes a callbacka callback passed passed characteristicscharacteristics to to node-local node-local XFS ￿lesystems at at all all scales scales and and in in fromfrom the the write write cache cache that that handles handles releasing releasing the the data data from from the the SSD. SSD. aggregateaggregate but but does does not not provide provide any support for for N-1 N-1 accesses. accesses. There is also a suite of command line tools for controlling its There is also a suite of command line tools for controlling its SymphonyFSSymphonyFS pushes pushes the the solution solution space further further providing providing a a light- light- behavior including issuing an immediate cache ￿ush and checking behavior including issuing an immediate cache ￿ush and checking weight,weight, distributed distributed￿￿lesystemlesystem supporting N-1 N-1 and and N-N N-N access access modes. modes. if there is any un￿ushed data still in SymphonyFS. These tools are if there is any un￿ushed data still in SymphonyFS. These tools are ItIt serves serves as as an an in-system in-system storage storage write-back cache cache for for Spider Spider 3, 3, and and available to the user but not required for SymphonyFS’s use. available to the user but not required for SymphonyFS’s use. providesprovides a a￿￿lesystemlesystem for for the the whole whole node, while while Spectral Spectral only only works works Using the same LAMMPS application example, SymphonyFS for those processes that have been interposed. SymphonyFS can Using the same LAMMPS application example, SymphonyFS for those processes that have been interposed. SymphonyFS can can expedite write outputs of LAMMPS as demonstrated in Figure continuously and asynchronously perform drains without having can expedite write outputs of LAMMPS as demonstrated in Figure continuously and asynchronously perform drains without having 2. In the job ￿le, SymphonyFS can be activated and mounted via to rely on close(), thus reducing job wall time even further. Sym- 2. In the job ￿le, SymphonyFS can be activated and mounted via to rely on close(), thus reducing job wall time even further. Sym- LSF’s alloc_￿ag. Once mounted, LAMMPS can create the ￿les and phonyFS can also handle ￿le-per-process (N-N) like Spectral, but LSF’s alloc_￿ag. Once mounted, LAMMPS can create the ￿les and phonyFS can also handle ￿le-per-process (N-N) like Spectral, but Select performance evaluations – Core isolation

• On Summit compute nodes, the GPFS client is limited to using a few cores – On NAMD core isolation (CI)

impact is evident step with_CI 0.020 without_CI – Minimizes jitter and maximizes per parallel application performance 0.015 – Summit is tuned for parallel

application performance (seconds) 0.010 – SMT-4 time

• 4 hardware threads per core 0.005 run

0.000 012345678910 11 Average 14 Sampling stepsOpen slide master to edit Select performance evaluations - NVMe vs. Spider 3

• Single node 4KB random read performance – PM1725a NVMe is rated for 1M 4KB IOPS – At low queue depths, NVMe can achieve double the performance 900000 1 2 4 8 16 32 64 of Spider 3 800000 700000 – At high queue depths, NVMe can 600000

easily provide >700K 4KB read 500000

IOPS IOPS 400000

300000

200000

100000 0 NVME 1 NVME 3 NVME 6 NVME 9 GPFS 1 GPFS 3 GPFS 6 GPFS 9 Job Job Job Job Job Job Job Job

15 Open slide master to edit Select performance evaluations – Spectral vs. Spider 3

• At scale performance testing, 4,096 Summit nodes • IOR, 6 ranks/node, write 1GB/rank, 1MB I/O size, N-N/FPP • Tremendous reduction inEnd-to-end time spent I/O Portfolio in I/ forO the at Summit scale Supercomputing Ecosystem SC ’19, November 17–22, 2019, Denver, CO, USA – Metadata is not passed downTable 2:to MeasuredSpider Spectral3 and GPFS performance on 4096 nodes of Summit. were used with 6 IOR tasks executing per node to represent a task running on each GPU of a Summit node. System Bandwidth(MiB/s) Open(s) Write(s) Close(s) The left side of ￿gure 10 provides the aggregate throughput for Spectral 8340459 .001 3.02 .203 IOR on the NVMe SSD, GPFS PFS, and SymphonyFS. Each node was Spider 3 35787 582.03 694.92 637.62 responsible for generating a 512 GB ￿le (twice the system memory size) using the standard synchronous POSIX API with a block size of 32 KB or 1 MB. The NVMe SSD drastically outperforms the This is due to the fact that Spectral does not add extra metadata on GPFS backend with 15 GB/s compared to 10 GB/s; SymphonyFS ⇠ the I/O critical path and the cost of creating ￿les on the local XFS was comparable to base NVMe SSD performance for 1 MB writes. 16 ￿lesystems is lower compared to the cost of creatingOpen slide￿les master on to theedit Unfortunately, for the smaller 32 KB writes, SymphonyFS performs GPFS GL4. poorly at only 2.5 GB/s. This phenomenon is a well-known problem In an another test on Summit, we ran a 128,000-atom Lennard- for FUSE-based ￿lesystems due to latency overheads and extended Jones LAMMPS job, with 6 ranks per node and generated a total of attribute checks by SELinux for each write. 2,304 output ￿les. While the ￿les are small, we show that writing The right side of the ￿gure provides time spent for a synthetic the output through Spectral reduced the overall walltime by 19% work￿ow using an IOR script that delivers "phased" checkpoints, when compared to writing directly to GPFS. similar to the behavior of LAMMPS-like applications (IOR options In one ￿nal Spectral test on Summit, we ran at 90% of system scale used: -a POSIX -i 4 -m -d 900 -w -s 85333 -b 1m -t 1m -N 0 -T 0 at 4,096 nodes. Using IOR, we ran 6 ranks per node, writing 1 GB -D 0 -g and -a POSIX -i 1 -m -d 900 -w -s 85333 -b 1m -t 1m -N 0 per rank with 1MB block transfers. We ran two measurements, the -T 0 -D 0 -g -e). Within the work￿ow, each task "computes" for 10 ￿rst being the performance to the node-local Burst Bu￿er through minutes and then generates 500 GB per node checkpoint ￿le. Once Spectral and the second being the performance to GPFS. This is all of the tasks have completed their checkpoint, they advance to intended to model the average checkpoint phase and the ￿nal check- the next compute phase. This cycle is iterated ￿ve times with the point phase for an application. In these results, Table 2, we show ￿nal checkpoint invoking fsync() to ensure all of the data is durable. the bandwidth, open time, write time, and close time. The results The graph provides a stacked bar for each backend with time to demonstrate at scale, a tremendous reduction in time spent in I/O write out a phase’s checkpoint. GPFS consumes the most time in even when compared against a GPFS result that entirely ￿ts in each checkpoint interval. SymphonyFS’s time requirements are on-node page cache. The gap in these results widen as large ￿les comparable to the base NVMe SSD except for the ￿nal checkpoint. are written and the role of bandwidth increases. A current issue im- For the ￿rst four checkpoints, SymphonyFS e￿ectively masked the pacting GPFS performance is opening thousands of ￿les in a single latency of transferring the data to GPFS while the next phase was directory. This causes lock contention and dominates performance. computing. In the ￿nal checkpoint, this latency is not masked and required approximately the sum of the time to write a checkpoint to the SSD and transfer to GPFS sequentially. Despite this, Sympho- nyFS still accelerated the application beyond using GPFS directly, with the additional bene￿t over the NVMe SSD of transferring the checkpoints to stable GPFS.

6 CONCLUSIONS We have presented the I/O subsystem design for the Summit super- computing ecosystem at OLCF. Our design includes a center-wide Spider 3 PFS layer augmented with an in-system storage layer to enhance end-user performance and experience across a range of use cases. To eliminate the burden of data scheduling with the in- system storage layer, we have implemented two software solutions, Spectral and SymphonyFS, that can handle bursty N-N and N-1 access patterns. While each solution imposes some restrictions, they o￿er linear scaling performance with the number of SSDs Figure 10: Application-level performance comparison be- without requiring user intervention. Overall, the design has bal- tween SymphonyFS, block-level NVMe, and the single GL anced cost and a number of user requirements and expectations 4 on the testbed. Left side ￿gure shows the performance including POSIX semantics, user accessibility and transparency, with respect to small (32KB) and large (1MB) block sizes. capacity, performance, I/O patterns, manageability, as well as the Right side ￿gure shows performance of each compute and demands from newly emerging ML/DL workloads. I/O phase independently for the large block size. SymphonyFS Performance: SymphonyFS evaluations were conducted on the Summit development system, Tundra. We ex- ACKNOWLEDGEMENTS ecuted two sets of evaluations using GPFS and XFS on SSD to This work was performed under the auspices of the U.S. DOE by compare against SymphonyFS. For these tests, 12 of the nodes Oak Ridge Leadership Computing Facility at ORNL under contract Select performance evaluations – SymphonyFS vs. Spider 3

• SymphonyFS small scale write test using IOR – vs. local XFS and one GL4 – Comparable results to local XFS – Not including the last checkpoint • Metadata impact – Small I/O hurts the FUSE/SymphonyFS performance – SymphonyFS masks the latency of transferring the data to Spider 3 while the next phase was computing • Last phase penalty can not be masked

17 Open slide master to edit Conclusions

• Cost-effective I/O subsystem design supporting multiple competing requirements • In-house developed I/O solutions for the in-system storage layer to support N-N, N-1, and emerging ML workloads – Linear performance scaling, high IOPS • A high-performance and large capacity center-wide scratch POSIX namespace supporting all OLCF systems and resources

18 Open slide master to edit Acknowledgements

This work was performed under the auspices of the U.S. DOE by Oak Ridge Leadership Computing Facility at ORNL under contract DE-AC05-00OR22725

19 Open slide master to edit Comparison of , Summit, and Frontier Systems

System Specs Titan Summit Frontier Peak 27 PF 200 PF ~1.5 EF

# cabinets 200 256 > 100

Node 1 AMD Opteron CPU 2 IBM POWER9™ CPUs 1 HPC and AI Optimized AMD EPYC CPU 1 NVIDIA K20X Kepler GPU 6 NVIDIA Volta GPUs 4 Purpose-Built AMD Radeon Instinct GPU On-node PCI Gen2 NVIDIA NVLINK AMD Infinity Fabric interconnect No coherence Coherent memory Coherent memory across the node across the node across the node

System Cray Gemini network Mellanox Dual-port EDR IB network Cray four-port Slingshot network Interconnect 6.4 GB/s 25 GB/s 100 GB/s Topology 3D Torus Non-blocking Fat Tree Dragonfly

Storage 32 PB, 1 TB/s, Lustre 250 PB, 2.5 TB/s, IBM Spectrum 2-4x performance and capacity Filesystem Scale™ with GPFS™ of Summit’s I/O subsystem.

Near-node No Yes Yes NVM (storage)

20 Open slide master to edit