End-To-End I/O Portfolio for the Summit Supercomputing Ecosystem

End-to-end I/O Portfolio for the Summit Supercomputing Ecosystem Sarp Oral, Sudharshan S. Vazhkudai, Feiyi Wang, Christopher Zimmer, Christopher Brumgard Jesse Hanley, George Markomanolis, Ross Miller, Dustin Leverman, Scott Atchley, and Veronica Vergara Larrea ORNL is managed by UT-Battelle, LLC for the US Department of Energy ORNL Summit System Overview System Performance The system includes Each node has • Peak of 200 Petaflops • 4,608 nodes • 2 IBM POWER9 processors (FP ) for modeling & 64 • Dual-rail Mellanox EDR • 6 NVIDIA Tesla V100 GPUs simulation InfiniBand network • 608 GB of fast memory • Peak of 3.3 ExaOps (FP16) • 250 PB IBM file system (96 GB HBM2 + 512 GB DDR4) for data analytics and transferring data at 2.5 • 1.6 TB of non-volatile artificial intelligence TB/s memory 2 Open slide master to edit Summit Node Schematic • Coherent memory across DRAM DRAM 7 TF 7 256 GB 256 GB TF 7 GPU GPU HBM HBM 16 GB 16 GB entire node 900 GB/s 900 GB/s • NVLink v2 fully 50 GB/s 50 GB/s 50 GB/s 50 GB/s interconnects three GPUs 135 GB/s 64 135 GB/s GB/s and one CPU on each P9 P9 7 TF 7 TF 7 GPU GPU HBM HBM 16 GB 16 GB side of node 50 GB/s 50 GB/s 900 GB/s 900 GB/s 16 GB/s • PCIe Gen 4 connects 50 GB/s 50 GB/s 16 GB/s NVM and NIC 50 GB/s 50 GB/s • Single shared NIC with NIC 7 TF 7 TF 7 GPU GPU HBM HBM 16 GB 16 GB 900 GB/s 900 GB/s dual EDR ports 6.0 GB/s Read NVM 2.2 GB/s Write 12.5 GB/s 12.5 GB/s TF 42 TF (6x7 TF) HBM/DRAM Bus (aggregate B/W) HBM 96 GB (6x16 GB) NVLINK DRAM 512 GB (2x16x16 GB) X-Bus (SMP) NET 25 GB/s (2x12.5 GB/s) PCIe Gen4 MMsg/s 83 EDR IB HBM & DRAM speeds are aggregate (Read+Write). All other speeds (X-Bus, NVLink, PCIe, IB) are bi-directional. 3 Open slide master to edit OLCF I/O Requirements • POSIX API – ~1,300 users access and request POSIX • Usability and transparency – No data islands, center-wide access and no application code changes OLCF workloads do not • Capacity distinguish between – O(100)s PB; min 30x of Summit CPU memory; 80x desired checkpoints and data • I/O performance outputs, and persist all output – O(TB/s); 50% of total CPU memory to be written in 5 minutes data • N-N and N-1 I/O support • ML workload support – High small-read IOPS • Cost efficient design – Not to exceed 15% of the Summit cost 4 Open slide master to edit OLCF I/O solutions architecture GPU1 GPU1 GPU1 GPU1 GPU2 P9 P9 GPU2 Summit Compute Nodes GPU2 P9 P9 GPU2 (4,608) GPU3 PCIe Gen4 GPU3 . GPU3 GPU3 /XFS /XFS NVM Rates: 5.8 GB/s Read In-system storage layer 2.1 GB/s Write NVM NVM 1 Million Read IOPS 1.6 TB Aggregate Capacity: 7.4 PB 170K Write IOPS Aggregate Write BW: 9.7 TB/s /SymphonyFS: FUSE- Mellanox Infiniband based Distributed FS EDR Fat-tree Network 2.5 TB/s Spectral Intercept Library /Spider3 Center-Wide PFS NSD I/O NSD I/O Capacity: 250 PB NSD I/O NSD I/O server server BW: 2.5 TB/s server server . 2.5 TB/s IBM Spectrum Scale GL4s to other Redundancy Redundancy (77) Redundancy Redundancy OLCF Group 1 Group 2 Group 1 Group 2 systems (211 HDDs) (211 HDDs) (211 HDDs) (211 HDDs) 5 Open slide master to edit Design challenges • Not possible to order from a catalogue • Provide a transparent mount point – GPFS at the time of the selection (<2014) – Users don’t like multiple mount points • Largest deployment was 18 PB – Users don’t like searching data in multiple • Highest throughput was 400 GB/s places • Highest create/s per single directory was 5 • Need a solution combining node-local – Multi-year collaboration with IBM and center-wide POSIX mount points • Identify performance bottlenecks • Need a solution moving data from • Verify design and implementation node-local mount points to GPFS • Deployment and acceptance issues transparently 6 Open slide master to edit Center-wide parallel file system (PFS) • Spider 3/Alpine • Each GL4 – POSIX namespace, shared – 2 P9 based NSD servers – IBM SpectrumScale/GPFS – 4 106 slot disk enclosures • 77 ESS GL4, w/ O(30K) 10TB NL-SAS – 12 Gbps SAS connected (NSD – enclosure) • IB EDR connected – 422 disks in total organized in 2 distributed – 250 PB usable, formatted RAID sets • ∼90x of 2.8 PB DDR+HBM of Summit • Each NSD – 2.5 TB/s aggregate sequential write/read – 2 IB ConnectX-5 EDR ports connect to – 2.2 TB/s aggregate random write/read Summit – 800K/s 32KB file transactions – 2 IB ConnectX-5 EDR ports connect to the rest of OLCF • create/open+write+close – ~30K 0B file create in a shared directory 7 Open slide master to edit Summit in-system storage layer • Each Summit compute node can write @ 12.5 GB/s to Alpine – Max out Alpine w/ 200 Summit compute nodes • Each Summit node has a 1.6 TB Samsung PM1725a NVMe, exclusive – 6 GB/s read and 2.1 GB/s write I/O performance – 5 drive writes per day (DWPD) – Formatted as XFS (node-local file system) • In aggregate Summit in-system storage layer – 7.4 PB @ 26.7 TB/s read and 9.7 TB/s aggregate write I/O performance – 4.6 billion IOPS in aggregate – 2.5 times the capacity of aggregate system DRAM and HBM • Question is how to aggregate and present this in software as an effective I/O solution? 8 Open slide master to edit Spectral • An I/O intercept library interposing between application and Spider 3 – https://code.ornl.gov/cz7/Spectral • No application code changes required and supports N-N and N-1workloads • All writes redirected to local XFS, later to be transferred to a specified location on Spider 3 by an environment variable in the job script • Does not add extra metadata on the I/O critical path • Detects the close() of the file handle and enqueues the data transfer to the data mover process that runs on the isolated system cores • Maintains a log of files in motion so that when an application exits it can use a call to the Spectral wait tool, to hold the job open until the remaining files are transferred • Feature complete, code hardening for operations 9 Open slide master to edit SymphonyFS • Fuse based file system, presenting a unified namespace of distributed local XFS filesystems and Spider 3 – https://code.ornl.gov/techint/SymphonyFS • Provides a single mount point combining node-local and PFS namespaces • No application code changes required and supports N-N and N-1workloads • Metadata and read calls are directed to Spider 3, writes directed to local XFS – Relies on Spider 3 for most metadata operations and adds latency by transiting them through FUSE • SymphonyFS daemon will then transfer the data from the local XFS namespace to the compute node local GPFS client • Applications must either avoid Read-After-Write and overlapping writes between nodes (not very common use cases for OLCF) or employ manual intervention with mechanisms such as fsync() • In development, improving metadata performance 10 Open slide master to edit Spectral and SymphonyFS workflow Unified Namespace App 1 App App App NVMe AppApp NVMe 2 2 Spectral SymphonyFS App 1 AppApp App NVMe AppApp NVMe 2 2 Spectral SymphonyFS 1 3 App 1 AppApp App NVMe AppApp NVMe 2 2 Spectral SymphonyFS GPFS 11 Open slide master to edit Spectral basic use case • Needs four easy changes to the application configuration – An update in the configuration file to tell the application to write to a path on the local XFS filesystem on node-local NVMe SSD – Setting PERSIST_DIR, the base directory location on local XFS where the application will write these files – Setting a location for Spectral to move the files to, on Spider 3, noted as PFS_DIR – Notifying the runtime system using the module and alloc_flags parameters to load the spectral libraries 12 Open slide master to edit OLCF end-to-end I/O solutions End-to-end I/O Portfolio for the Summit Supercomputing Ecosystem SC ’19, November 17–22, 2019, Denver, CO, USA Table 1: End-to-end I/O solutions for Summit and its Ecosystem. Spectral and SymphonyFS reect application-level performance.End-to-end Spectral I/O I/O Portfolio library for performsthe Summit at Supercomputing the rate of aggregate Ecosystem local XFS lesystem SC performance. ’19, November 17–22, SymphonyFS 2019, Denver, performance CO, USA numbersTable 1: are End-to-end projections I/O based solutions on the for small-scale Summit and test its results Ecosystem. and re Spectralect N-1 and I/O SymphonyFS workload with re large-blockect application-level sizes. performance. Spectral I/O library performs at the rate of aggregate local XFS lesystem performance. SymphonyFS performance Capacity Performance Supported I/O numbers are projectionsPOSIX basedUsability on the small-scale test results and reect N-1 I/O workload with large-blockML/DL sizes. Support (aggregate) (aggegate) Modes Spider 3 Compliant High 250Capacity PB 2.5 TB/s forPerformance reads and writes SupportedN-N and N-1 I/O Low POSIX Usability ML/DL Support Local XFS Compliant Low 7.4 PB(aggregate) (1.6 TB/node) 24.8 TB/s for reads,(aggegate) 9.6 TB/s for writes ModesN-N High SpectralSpider 3 CompliantN/A MediumHigh 7.4250 PB PB 24.8 TB/s2.5 TB/s for reads, for reads 9.6 TB/s and writes for writes N-NN-N and N-1 LowHigh SymphonyFSLocal XFS CompliantCompliant HighLow 7.4 PB7.4 (1.6 PB TB/node) 2.524.8 TB/s TB/s for for reads, reads, 9.6 9.6 TB/s TB/s for for writes writes N-NN-N and N-1 HighHigh Spectral N/A Medium 7.4 PB 24.8 TB/s for reads, 9.6 TB/s for writes N-N High SymphonyFS Compliant High 7.4 PB 2.5 TB/s for reads, 9.6 TB/s for writes N-N and N-1 High SymphonyFS.

Load more