OPUG Panel SC18 Glenn Johnson Systems Architect November 12, 2018

SC18 Omni-Path User’s Group BOF

November 12, 2018

• Welcome- Richard Underwood – Pittsburgh Supercomputing Center • Intel OPA Update- Philip Murphy - Intel • Panelists – Jerry Perez – UT Dallas – Toshio Endo – Tokyo Tech GSIC (TiTech) – Glenn Johnson – University of Iowa

3 AGENDA

• Welcome/Introduction (5 Min, Rich Underwood) • Intel Fabric Update (10 Minutes, Phil Murphy) • Panelist Introduction(~5 Minutes Each, Panelists) • Panel Discussion(20 Minutes, All, moderated by Rich Underwood) • Closing Remarks (5 min, Rich Underwood)

4 OPUG NEWS

• OPUG Steering Committee formed and meeting • OPUG Charter created and ratified by Steering Committee • 3 OPUG Face to Face Meetings in 2018 (ISC18, PEARC18, SC18) • New OPUG Website – www.opug.org

5 Bridges – New User Communities Richard Underwood HPC System Administrator 11/12/2018

6 © 2018 Pittsburgh Supercomputing Center © 2018 Pittsburgh Supercomputing Center OPA Operations • The Good – Works out of the box pretty easily – Easy to expand the fabric • The Bad – Was harder to diagnose problems • Future Wants – Would love a simple RPM install

Philip Murphy Intel DCG Connectivity Group Nov. 12, 2018

10 Notices and Disclaimers

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. Check with your system manufacturer or retailer or learn more at intel.com.

No computer system can be absolutely secure.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/benchmarks . Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/benchmarks . Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Performance estimates were obtained prior to implementation of recent software patches and firmware updates intended to address exploits referred to as "Spectre" and "Meltdown." Implementation of these updates may make these results inapplicable to your device or system.

Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.

Intel® Omni-Path 11 Architecture Intel® OPA Market Success

TOP500 Momentum Deployed in all geos, all usages

• Record number of Intel® OPA entries on TOP500 – 43 total

• Powering the 6 fastest academic clusters in the world

• >28% increase in Total FLOPS since June

Source: Top500.org November’18 Growing Fabric Builders Ecosystem

• Over 90 Fabric Builder members

Intel® Omni-Path 12 Architecture Storage Enhancements In IFS 10.8: – Improved Verbs Performance – Multi-core completion queue handling – IPoFabric Core changes − Separate TX/RX completion handling Improved per-core utilization − (Only RHEL7.5 in 10.8, other distros will follow) – Offload core 0 to improve IA housekeeping – Interim updates to improve aggregate bandwidth of IP Router solution − Dense router support, multiple HFI for aggregate BW improvements − Addressed inefficiencies when two+ HFI ports per NUMA node – Best performance with Jumbo MTU More enhancements coming in IFS 10.9

Intel® Omni-Path 13 Architecture Accelerated GPU Performance – GPU Direct RDMA GPUDirect RDMA 45 osu_latency -d cuda D D • RDMA across systems 40 35 Up to GPUDirect P2P ) 30 68% FASTER with 1 • DMA on same system/PCIe uSec 25 optimizations 20

PSM2 NUMA Awareness ( Latency 15 • Intelligent HFI selection 10 • If multiple HFI & GPUs, PSM2 uses local HFI 5 0 3 30 300 3000 30000 New capabilities Message Size (Bytes) • GDRCopy - Further improves latency & bandwidth performance Baseline Optimized GDRCopy • Support for CUDA 9.1 with OpenMPI

Intel® Xeon® processor E5-2699 v4, SLES 12.3 4.4.73-5-default, 0xb00001b microcode. Intel Turbo Boost Technology enabled. Dual socket servers connected back to back with no switch hop. NVIDIA* Tesla* P100 and Intel® OPA HFI both connected to second CPU socket. 64GB DDR4 memory per node, 2133MHz. OSU Microbenchmarks version 5.3.2 Open MPI 2.1.2-cuda-hfi as packaged with IFS 10.7. 168% higher claim based on 4 byte latency. Optimized performance: mpirun -np 2 --map-by ppr:1:node -host host1,host2 -x PSM2_CUDA=1 -x PSM2_GPUDIRECT=1 -x HFI_UNIT=1 ./osu_latency -d cuda D D Baseline performance: same as above but with “-x PSM2_GDRCPY=0 (off)”

Intel® Omni-Path 14 Architecture Higher Bandwidth Intel® OPA GPU Buffer Transfers NVIDIA* Tesla* P100

Uni-dir Bandwidth Bi-dir Bandwidth 7000 5000 osu_bw -d cuda D D osu_bibw -d cuda D D 6000 4000 5000

3000 4000

Up to 30% 3000 2000 with Up to HIGHER 2000

1 Bandwidth (MB/s) 73% HIGHER with Bandwidth Bandwidth (MB/s) optimizations 1000 2 1000 optimizations

0 0 0 8192 16384 24576 32768 0 2048 4096 6144 8192 Msg Size (Bytes) Msg Size (Bytes) Baseline IFS 10.7 Optimized

Intel® Xeon® processor E5-2699 v4, SLES 12.3 4.4.73-5-default, 0xb00001b microcode. Intel Turbo Boost Technology enabled. Dual socket servers connected back to back with no switch hop. NVIDIA* P100 and Intel® OPA HFI both connected to second CPU socket. 64GB DDR4 memory per node, 2133MHz. OSU Microbenchmarks version 5.3.2 Open MPI 2.1.2-cuda-hfi as packaged with IFS 10.7. 1 30% higher claim based on 8KB uni-directional bandwidth. 273% higher claim based on 64B bi-directional bandwidth. Optimized performance: mpirun -np 2 --map-by ppr:1:node -host host1,host2 -x PSM2_CUDA=1 -x PSM2_GPUDIRECT=1 -x HFI_UNIT=1 ./osu_latency -d cuda D D Baseline performance: same as above but with “-x PSM2_GDRCPY=0 (off)”

Intel® Omni-Path 15 Architecture HPC Cloud - Multi-Tenant Dynamic vFabrics

Virtual Fabrics (vFabrics) provides fabric security & QoS

Partition A Partition B Security Feature – traffic isolation for (Tenant A) (Tenant B) multiple tenants Node0 (service) Node Y Node X Key Capabilities - Dynamically create, change, remove vFabrics – Without FM restart HFI HFI HFI – Without disruption to other vFabrics

Supports - up to 1,000 vFabrics Fabric Additional Links and Manager Switches More HPC Cloud enhancements coming in IFS 10.9

Intel® Omni-Path 16 Architecture Next Generation Fabric from Intel

. 200 Gbps for higher performance, faster time-to-solution . Increased switch radix to 64 for continued power and space savings, TCO leadership and fabric scale . Enhanced performance features − High message rate throughput − Low latency that stays low at scale − Increased fabric scale − Improved QoS for application performance . Interoperable and Compatible with current generation Ready for Exascale, More Complex Workloads and Deeper Learning . Coming in 2019

Intel® Omni-Path 17 Architecture 18 Omni-Path Architecture solutions for heterogeneous networks in compute environments Dr. Jerry F. Perez University of Texas at Dallas Texas Tech OPA Cluster – Quanah

• Named after Quanah Parker, a Texas Comanche Indian Chief. • Built in 2016-2017 in two phases. • Uses 100 Gb/sec non-blocking Omnipath Fabric • The cluster connected to an InfiniBand based Lustre based file storage system. • LNET OP-IP Routers made the connection Possible between the OPA network and the IB network, The LNET Router makes heterogeneous OPA/IB clusters possible. LNET Routing explained

• Storage servers are on LAN1, a Mellanox-based InfiniBand network – 10.10.0.0/24 • Clients are on LAN2, an Intel • OPA network - 10.20.0.0/24 • The router is between LAN1 and LAN2 at 10.10.0.1 and 10.20.0.1 • For the purpose of setting up the lustre.conf files in this example, the Lustre network on the Intel Omni-Path fabric is named o2ib0. • and the Lustre network on the InfiniBand* fabric is named o2ib2. LNET routers scale to translate OPA to IB

• When only a few LNET routers are used, they become a bottleneck and effect performance. When LNET Routers are matched with OPA/IB networks 1 x 1 the performance increases. Quanah at Texas Tech OPA at UTD

• Several departmental servers use OPA. • New research groups are adopting OPA as a standard fabric for their HPC clusters. • LNET routing is a possibility to link OPA HPC clusters to central IB storage. Thank you! Tokyo Tech TSUBAME3.0 Supercomputer with Omni-Path Interconnect

Toshio Endo GSIC, Tokyo Institute of Technology Overview of TSUBAME3.0 BYTES-centric Architecture, Scalability to all 540 nodes, the entire memory hierarchy Operations Integrated by: Silicon Graphics since Aug. 2017  Hewlett Packard Full Bisection Bandwidth Intel Omni-Path Interconnect. 4 ports/node Full Bisection / 432 Terabits/s bidirectional ~x2 BW of entire Internet backbone traffic DDN Storage (Lustre FS 15.9PB+Home 45TB)

540x Compute Nodes SGI ICE XA + New Blade Intel Xeon CPUx2 + NVIDIA Pascal GPUx4 (NV-Link) 256GB memory 2TB Intel NVMe SSD 47.2 AI-Petaflops, 12.1 Petaflops Research opened up by TSUBAME3.0

Large Scale Graphs

Big Data and Big Data AI- Mutual and Semi- Accelerating Conventional Automated Co- ML/AI Apps HPC Apps Oriented Acceleration of • Flame simulations for and HPC HPC and Image and Video energy propulsion Methodologies • Environment-friendly BD/ML/AI urban planning • Real-time tsunami simulation • Air-water violent flow simulation Robots / Drones • Ultrafast genome analysis, Optimizing System etc… Software and Ops Future Big Data・AI Supercomputer Design November 27, 2018 29 TSUBAME3.0 Compute Node: HPE/SGI ICE-XA Ultra high performance & bandwidth 100Gbps DMI x16 PCIe PCH OPA HFI “Fat Node” 100Gbps • 2 Intel Broadwell Xeon + 4 SXM2(NVLink) x4 PCIe x16 PCIe x16 PCIe

SSD PLX OPA HFI NVIDIA Pascal P100 GPU OPAInterconnect DIMM x16 PCIe x16 PCIe • Intel Omni-Path 100Gbps x 4 = 400Gbps DIMM CPU 0 (100Gbps per GPU) DIMM GPU 0 GPU 1 • DIMM High I/O Bandwidth - Intel 2 TB NVMe SSD

QPI NVLink

DIMM DIMM GPU 2 GPU 3 DIMM CPU 1 x16 PCIe x16 PCIe DIMM x16 PCIe 100Gbps PLX OPA HFI x16 PCIe 100Gbps x16 PCIe OPA HFI

x540 nodes TSUBAME3.0 Fat-Tree Omni-Path Interconnect

Compute nodes Edge switches Core switches 100G x 4port x 540 48port x 120 768port x 3 (36p used) (>720p used) 9nodes = 1group

Direct connection (no cables)

60 groups

connections to storage omitted Resource Partitioning in TSUBAME 3.0

Divide a compute node into some OPA GPU partitions in a hierarchical manner HFI CPU 14 core H • F: Full node OPA GPU • H: 1/2 node HFI • Q: 1/4 node OPA G: 1 GPU + 2 CPU Core GPU 7 core • HFI Q CPU • C4: 4 CPU Core 2 core OPA G GPU • C1: 1 CPU Core 4 core C4 HFI For this purpose, we are using “cgroups”

November 27, 2018 32 User Experiences with T3’s OPA

The Good • More stable hardware environment • Less perf. variability (BW/latency) for large-scale jobs • Most issues reported in OPUG 2017 have been solved, thanks!!! Current issues • A minor issue with GPU direct (there is workaround) This occurs with OPA 10.7 We will install 10.8 on next March Requests towards future • QoS or bandwidth allocation for each cgroups/container • SR-IOV wanted!

November 27, 2018 33 OPUG Panel SC18

Glenn Johnson Systems Architect

November 12, 2018

Information Technology Services – Research Services Argon HPC Cluster

• Condo style cluster • CentOS-7, Son of Grid Engine • 343 Lenovo 28-core Broadwell OPA connected nodes • 23 Super Micro 20-core Skylake Silver GPU nodes (no OPA) • 124 Nvidia GPUs (various types) • Hyperthreading enabled • 20,128 job slots

Information Technology Services – Research Services Argon HPC purpose

• Research • ~1000 user accounts • 90 departments • >50 groups have purchased nodes • Teaching • Queues set up for HPC training • Queues reserved for ongoing courses • Queues set up ad-hoc for other teaching/training

Information Technology Services – Research Services OPA Topology

• Leaf/Spine • Top of Rack 48-port switches • 9 racks • 40 nodes per rack • 5:1 oversubscription between racks • MPI nodes are in same rack • Rarely have jobs that span racks • Fabric manager on two non-compute nodes

Information Technology Services – Research Services Experience

• Good • Fairly straight forward to set up • Good tools for checking status • Fairly good tools for validation • Bad • OPA stack trails OS kernel versions • Kernel updates often have to wait for OPA stack to catch up • Limited contexts can be a problem on shared nodes • Fabric manager on switch is limited

Information Technology Services – Research Services Thanks

Glenn Johnson

Iowa City, Iowa U.S.A

319-384-1209

https://hpc.uiowa.edu/

[email protected]

Information Technology Services – Research Services Photos

Intel® Omni-Path 40 Architecture