Enhancing the Performance and Scalability of Third-Party Libraries in the Sandia I/O Software Stack
Total Page:16
File Type:pdf, Size:1020Kb
Enhancing the Performance and Scalability of Third-Party Libraries in the Sandia I/O Software Stack HDF5 User Group Meeting 2020, October 14- 16, 2020 Gregory Sjaardema, Engineering Sciences Center, Sandia National Laboratories, Albuquerque NM Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. Unclassified, Unlimited Release. SAND2020-10955 C Scalability Capacity Complexity Exodus Background ◦ Developed 1992 ◦ Replaced old fortran-unformatted read/write files with a ◦ device-independent, random access database. ◦ Based on NetCDF ◦ Single format used by all SNL Structural and Thermal FE Codes ◦ Nodes, Homogenous Element Blocks, Node Sets, Side Sets ◦ Transient Data on Nodes and Elements. ◦ A Suite of mesh generation, preprocessing, postprocessing, visualization, and translation tools developed to generate and modify exodus files. SEACAS ◦ Backward Compatibility Essential Sjaardema: Sub-JOWOG-34, 2017 SAND2017-1050 C Exodus Evolution Capacity: ◦ 1992: CDF1 ~34 Million Nodes/Elements (~NetCDF-2.3.X) ◦ 2004: CDF2 ~500 Million Nodes/Elements (NetCDF-3.6.0) ◦ Internal changes to exodus format to reduce dataset sizes ◦ 2008: NetCDF-4 (HDF5 based) 2.1 Billion Nodes/Elements ◦ 2012: 64-bit Integer changes (HDF5 Enabled) Multi-Billion Nodes/Elements ◦ Converts integer size on the fly. Ids, indices, maps Capability: ◦ Named blocks, sets, attributes, and maps ◦ Transient variables on all blocks and sets ◦ “unlimited” string size replaced fixed 32 character limit ◦ Full topology support (Element->Face->Edge->Node) ◦ Arbitrary Polyhedral element support ◦ Compression (via HDF5) ◦ File Groups (via HDF5) ◦ Assemblies, Entity Attributes, Discontinuous Galerkin Sjaardema: Sub-JOWOG-34, 2017 SAND2017-1050 C Exodus Evolution -- Parallel Original workflow was file-per-processor ◦ External tools to split to parallel and join from parallel ◦ Extension library “nemesis” provided parallel data structures ◦ Each “processor” file is valid exodus file. Exodus becomes “parallel-aware” ◦ Nemesis Embedded in Exodus (ne_? Changed to ex_?) ◦ PNetCDF and HDF5 provide parallel IO capabilities ◦ Can support auto-decomposition and auto-join Sjaardema: Sub-JOWOG-34, 2017 SAND2017-1050 C I/O Subsystem – IOSS Library ◦ Started as the IO component of the Sierra project – 12/1999 ◦ Provide a database-independent interface to applications ◦ (Exodus, CGNS, SAF, XDMF, ADIOS2, …) ◦ Also functions as a “pseudo-C++ API” for the exodus library ◦ Supports Advanced HPC Capabilities: ◦ Kokkos Data ◦ Burst Buffer ◦ Data Warehouse (FAODEL) ◦ Embedded Visualization ◦ Auto-decomposition ◦ Replaces legacy file-per-processor mode ◦ Uses either HDF5 or PnetCDF for parallel input ◦ Uses decomposition methods in Zoltan and ParMETIS ◦ Supports Exodus and CGNS (Structured and Unstructured) ◦ Auto-join (single file output) option ◦ Uses HDF5 or PnetCDF for parallel output ◦ Scalability issues.... Being addressed. ◦ HDF5 is used by 3 of the supported Data types. Parallel Scalability -- Exodus 2 Billion Elements Sequoia Lustre Filesystem 1000000 Week 100000 Execution - Old Day 10000 Large model being run on Sequoia Application reported “analysis is hanging”Hour 1000 Ran same mesh in IOSS simulator with tracing enabled to determine what was happening 100 EXECUTION TIME (SECONDS) Initial results looked promising…f Minute 10 1 64 256 1,024 4,096 16,384 65,536 262,144 PROCESSOR COUNT Parallel Scalability -- Exodus 2 Billion Elements Sequoia Lustre Filesystem 1000000 Execution - Old Week 100000 Open - Old Day 10000 Hour 1000 100 EXECUTION TIME (SECONDS) Minute 10 1 64 256 1,024 4,096 16,384 65,536 262,144 PROCESSOR COUNT A supercomputer is a device for turning compute-bound problems into I/O-bound problems Parallel Scalability -- Exodus 2 Billion Elements Sequoia Lustre Filesystem 1000000 Execution - Old Week 100000 ◦ Added option to use new (at the time) HDF5 collective metadata option Open - Old Day 10000 ◦ Additional performance improvements, ◦ parallel fixes, ◦ code cleanups Hour 1000 100 EXECUTION TIME (SECONDS) Minute 10 1 64 256 1,024 4,096 16,384 65,536 262,144 PROCESSOR COUNT A supercomputer is a device for turning compute-bound problems into I/O-bound problems Parallel Scalability -- Exodus 2 Billion Elements Sequoia Lustre Filesystem 1000000 Execution - New Execution - Old Week ◦ Added option to use new HDF5 collective metadata option 100000 Open - Old ◦ ◦ Additional performance improvements, Day Open - New ◦ parallel fixes, 10000 ◦ code cleanups ◦ Work closely with HDF5 and NetCDF developers Hour 1000 ◦ Working with HDF5 – ◦ flush, truncate, open scalability issues 100 EXECUTION TIME (SECONDS) Minute 10 1 64 256 1,024 4,096 16,384 65,536 262,144 PROCESSOR COUNT A supercomputer is a device for turning compute-bound problems into I/O-bound problems Parallel Scalability – Exodus AutoJoin 300 134 Million Elements, rzuseq 250 Improving parallel scalability of auto-join (single file output) 200 Compose FPP 150 ◦ Patch NetCDF/HDF5 to avoid unneeded data access (PR 336) 100 ◦ Code review of Ioss “autojoin” routines Execution Time (sec) 50 0 32 128 512 Preliminary results look promising, more to do: Processor Count ◦ Darshan Profiling Processor Count 2 8 32 128 512 2048 8192 32768 ◦ MPI Profiling 10000 ◦ MPI-IO Tuning Fill - Redefine Fill - Elapsed Compose Timing ◦ Filesystem Tuning 1000 NoFill - Redefine NoFill - Elapsed Working with HDF5 – 100 ◦ Flush, truncate, close scalability issues 10 Execution Time, seconds 1 0.1 Input/Output performance improvements CGNS Structured Mesh 100000 Day Baseline Add MetaData BCast 10000 Improved N->1 Hour File-per-Processor (fpp) 1000 MPI_BCast() Serial Reference 100 Minute 10 1 Second Execution (seconds) 0.1 0.01 0.001 32 128 512 MPI Ranks 2,048 8,192 32,768 Input/Output performance improvements CGNS Structured Mesh 100000 Day Baseline Add MetaData BCast 10000 Improved N->1 Hour File-per-Processor (fpp) 1000 MPI_BCast() Serial Reference 100 Compact Storage Minute 10 1 Second Execution (seconds) 0.1 0.01 0.001 32 128 512 MPI Ranks 2,048 8,192 32,768 Input/Output performance improvements – Large Model CGNS Structured Mesh 100000 Baseline Improved N->1 10000 Improved, Striped File-per-Processor (fpp) Serial Reference 1000 100 10 Execution Time (seconds) Time Execution 1 MPI Ranks 0.1 32 64 128 256 512 1024 2048 4096 8192 16384 Input/Output performance improvements – Large Model CGNS Structured Mesh 100000 Baseline Improved N->1 10000 Improved, Striped File-per-Processor (fpp) Serial Reference 1000 Improvements: 100 • 2 to 3 orders of magnitude • Hours to seconds… • Competitive with FPP 10 • Additional speedups (stripe, …) Execution Time (seconds) Time Execution 1 MPI Ranks 0.1 32 64 128 256 512 1024 2048 4096 8192 16384 Input/Output performance improvements – Large Model CGNS Structured Mesh 100000 Baseline Improved N->1 Recipe for Success… 10000 Improved, Striped • Test on representative models File-per-Processor (fpp) • Use simulator for faster trials Serial Reference • Instrument/Trace/Profile 1000 • Collaborate with TPL Developers • Test hypotheses • Retest Improvements: • Verify on different models 100 • 2 to 3 orders of magnitude • Verify in actual application • Hours to seconds… • Productionize / Stabilize • Competitive with FPP • Repeat… 10 • Additional speedups (stripe, …) Execution Time (seconds) Time Execution 1 MPI Ranks 0.1 32 64 128 256 512 1024 2048 4096 8192 16384 32 Exodus speedups… Exodus format file in NetCDF-4 (HDF5) format • 64-bit integers, 64-bit doubles 1000 • 48.3 Gbyte file • 390 Million finite element nodes Baseline • 49 Million 27-node hexahedral elements • 1 time step with 8 variables per node Compact 100 Lustre Filesystem • “nscratch” 11 PByte • Input file stripe count = 1 FPP • Output file stripe count = 36 10 Execution Time (seconds) Run on “serrano” CTS-1 system • 2.1 GHz processors • Dual sockets with 18 cores each • Intel Broadwell® E5-2695 v4 • ~1.2 TFLOPs per node 1 • 128 GB RAM per node (3.55 GB per core) 32 64 128 256 • 512Intel Omni1024 -Path2048 high4096 speed8192 interconnect16384 32768 MPI Ranks 33 Exodus speedups… 1000 Baseline Compact 100 FPP 10 Execution Time (seconds) 1 32 64 128 256 512 1024 2048 4096 8192 16384 32768 MPI Ranks 34 Exodus speedups… 1000 Baseline Significant Speedup Realized Compact 100 • NetCDF PR1570 (HDF5 Compact Storage). • Merged, Released in 4.7.4. • Uses existing API function FPP • Still slower than FPP by >2X; more profiling needed 10 Execution Time (seconds) 1 32 64 128 256 512 1024 2048 4096 8192 16384 32768 MPI Ranks 35 Cgns many zone optimization Creating a cgns file with “many” zones Profiling showed “H5Literate_by_name” major contributor exposes N^2 behavior H5Literate_by_name changed to H5Lexists ◦ Metadata only being written Code calling this function no longer shows as a hot-spot ◦ Serial or Parallel Still N^2 behavior in another part of code due to string Execution Time for Zone Create Testing 1,200 compares. Looking at hashing strings to avoid this cost. Baseline 1,000 Thousands H5Literate_by_name Change 800 Remove strcmp cost With both of these fixes, behavior is near-linear performance out 600 to a million zones. Execution Time 400 200 0 0 200,000 400,000 600,000 800,000 1,000,000 Zone Count Input by Scot Breitenfeld and Mickael Philit 36 Conclusions Scalability, Complexity, and Capacity are very important to HPC Applications