MPI on Aurora

AN OVERVIEW OF AURORA, ARGONNE’S UPCOMING EXASCALE SYSTEM ALCF DEVELOPERS SESSION COLLEEN BERTONI, SUDHEER CHUNDURI www.anl.gov AURORA: An Intel-Cray System Intel/Cray machine arriving at Argonne in 2021 Sustained Performance greater than 1 Exaflops 2 AURORA: A High-level View § Hardware Architecture: § Intel Xeon processors and Intel Xe GPUs § Greater than 10 PB of total memory § Cray Slingshot network and Shasta platform § IO • Uses Lustre and Distributed Asynchronous Object Store IO (DAOS) • Greater than 230 PB of storage capacity and 25 TB/s of bandwidth § Software (Intel One API umbrella): § Intel compilers (C,C++,Fortran) § Programming models: DPC++, OpenMP, OpenCL § Libraries: oneMKL, oneDNN, oneDAL § Tools: VTune, Advisor § Python 3 Node-level Hardware The Evolution of Intel GPUs Source: Intel 5 Intel GPUs § Intel Integrated GPUs are used for over a decade in § Laptops (e.g. MacBook pro) § Desktops § Servers § Recent and upcoming integrated generations : § Gen 9 – used in Skylake based nodes § Gen 11 – used in Ice Lake based nodes § Gen 9: Double precision peak performance: 100-300 GF § Low by design due to power and space limits Layout of Architecture components for an Intel Core i7 processor 6700K for desktop systems (91 W TDP, 122 mm) § Future Intel Xe (Gen 12) GPU series will provide both integrated and discrete GPUs 6 Intel GPU Building Blocks EU: Execution Unit Subslice L2 Slice: 24 EUs SIMD FPU Dispatch&I$ L1 Shared Local Memory (64 KB/subslice) SIMD FPU Subslice: 8 EUs Send 8x EU Sampler L2 $ L3 Data Cache Branch Dataport L1 $ (512 KB/slice) Thread Arbiter Thread RT InstructionFetch Slice: 24 EUs Shared Local Memory (64 KB/subslice) Subslice: 8 EUs Sampler L2 $ L3 Data Cache L1 $ (512 KB/slice) Slice: 24 EUs Shared Local Memory (64 KB/subslice) Subslice: 8 EUs Sampler L2 $ L3 Data Cache L1 $ (512 KB/slice) 7 Gen 9 IRIS (GT4) GPU Characteristics Characteristics Value Notes Clock 1.15 GHz Slices 3 EUs 72 3 slice * 3 sub-slices * 8 EUs Hardware Threads 504 72 EUs * 7 threads Concurrent Kernel Instances 16,128 504 threads * SIMD-32 L3 Data Cache Size 1.5 MB 3 slices * 0.5 MB/slice Max Shared Local Memory 576 KB 3 slice * 3 sub-slices * 64 KB/sub-slice Last Level Cache Size 8 MB eDRAM size 128 MB 32b float FLOPS 1152 FLOPS/cycle 72 EUs * 2 FPUs * SIMD-4 * (MUL + ADD) 331.2 DP 64b float FLOPS 288 FLOPS/cycle 72 EUs * 1 FPU * SIMD-2 * (MUL + ADD) GFlops 32b integer IOPS 576 IOPS/cycle 72 EUs * 2 FPUs * SIMD-4 8 Aurora Compute Node § 2 Intel Xeon (Sapphire Rapids) Xe Xe processors § 6 Xe Architecture based GPUs Xe Xe (Ponte Vecchio) Xe Xe § 8 Slingshot Fabric endpoints Slingshot Slingshot § Unified Memory Architecture CPU CPU across CPU & GPU DRAM DRAM § All-to-all connectivity within node § Low latency and high bandwidth 9 Network Interconnect Cray Slingshot Network § Dragonfly topology: Global links § 3-hop network topology § Groups connected globally all-to-all with optical cables Flattened Butterfly (Intra-group) § High bandwidth switch: Switch Switch Switch Switch Switch Switch Switch Switch § 64 ports at 200 Gb/s in each direction Nodes Nodes Nodes Nodes Nodes Nodes Nodes Nodes § Total bandwidth of 25.6 Tb/s per switch 11 Rosetta Switch HPC Features § QoS – Traffic Classes § Class: Collection of buffers, queues and bandwidth § Intended to provide isolation between applications via traffic shaping § Aggressive adaptive routing § Expected to be more effective for 3-hop dragonfly due to closer congestion information § Multi-level congestion management § To minimize the impact of congested applications on others Rosetta Switch § Very low average and tail latency (64 ports x 200 Gbps/dir) § High performance multicast and reductions 12 Slingshot Quality of Service Classes § Highly tunable QoS classes § Priority, ordering, routing biases, minimum bandwidth guarantees, maximum bandwidth Traffic type 1 constraint etc. § Supports multiple, overlaid virtual Traffic type 2 networks … Traffic type 3 § High priority compute § Standard compute § Low-latency control & synchronization Bandwidth sharing without QoS § Bulk I/O § Scavenger class background § Jobs can use multiple traffic classes Traffic class 1 § Provides performance isolation for different types of traffic Traffic class 2 § Smaller-size reductions can avoid getting Traffic class 3 stuck behind large messages § Less interference between compute and I/O Bandwidth sharing with QoS 13 Congestion Management § Hardware automatically tracks all outstanding packets § Knows what is flowing between every pair of endpoints § Quickly identifies and controls causes of congestion § Pushes back on sources (injection points) … just the right amount § Frees up buffer space for for other traffic § Other traffic not affected and can pass stalled traffic § Avoids HOL blocking across entire fabric § Fast and stable across a wide variety of traffic patterns § Suitable for dynamic HPC traffic § Performance isolation between application on same QoS class § Applications much less impacted by other traffic on the network – predictable runtimes § Lower mean and tail latency – a big benefit for apps with global synchronization 14 GPCNeT (Global Performance and Congestion Network Tests) § A new benchmark developed by Cray in collaboration with Argonne and NERSC § Publicly available at https://github.com/netbench/GPCNET § Goals: § Proxy for real-world application communication patterns § Measure network performance under load § Look at both mean and tail latency § Look at interference between workloads § How well does network perform Congestion management? 15 Congestion Impact in Real Systems 16 Congestion Impact in Real Systems 17 Congestion Impact in Real Systems § Infiniband does better than Aries § Impact worsens with scale and taper 18 Congestion Impact in Real Systems § Infiniband does better than Aries § Impact worsens with scale and taper § Slingshot does really well 19 MPI on Aurora § Intel MPI & Cray MPI - MPI 3.0 standard compliant § The MPI library will be thread safe and allow applications to use MPI from individual threads § Efficient MPI_THREAD_MUTIPLE (locking optimizations) § Asynchronous progress in all types of nonblocking communication § Nonblocking send-receive § Nonblocking collectives and § One-sided § Hardware optimized collective implementations § Topology optimized collective implementations § Supports MPI tools interface § Control variables 20 IO Aurora IO Subsystem Overview § Two-tiered approach to storage § DAOS provides high-performance storage for compute datasets and DAOS § Lustre provides traditional parallel file system Compute Groups checkpoints Nodes § Distributed Asynchronous Object Store (DAOS) NVM Storage Login and § Provides the main storage system for Aurora Compute Service • Greater than 230 PB of storage capacity nodes Slingshot DAOS<>Lustre/GPFS Nodes Dataset movement • Greater than 25 TB/s of bandwidth § New storage paradigm that provides significant performance improvements Libs, binaries, Gateway • Enable new data-centric workflows and namespace Nodes • Provides a POSIX wrapper with relaxed POSIX semantics § Lustre Infiniband GPFS Tier § Traditional PFS Lustre Tier HDD HDD Storage § Provides global namespace for the center Storage § Compatibility with legacy systems Pre-installed Storage § Storage for source code, binaries, and libraries § Storage of “cold” project data 22 I/O – User Perspective § User see single namespace which is in Lustre § “Links” point to DAOS containers within the /project directory § DAOS aware software interpret these links and access the DAOS containers Regular PFS directories & files § Data resides in a single place (Lustre or DAOS) HDF5 Container DAOS POSIX Container § Explicit data movement, no auto-migration /mnt DAOS MPI-IO Container § Users keep daos § source and binaries in Lustre lustre § Bulk data in DAOS PUUID1 PUUID2 users libs projects CUUID2 CUUID3 Buzz mkl.so hdf5.so Apollo Gemini CUUID1 POSIX Container MPI-IO Container HDF5 root MPI-IO file Container dir dir dir MPI-IO MPI-IO .shrc moon.mpg data data data datadata datadata datadata file file Simul.h5 Result.dn Simul.out datafile datafile datafilet EA: daos/hdf5://PUUID1/CUUID1 EA: daos/POSIX://PUUID2/CUUID2 EA: daos/MPI-IO://PUUID2/CUUID3 23 SOFTWARE Pre-exascale and Exascale US landscape Delivery CPU + Accelerator Vendor Summit 2018 IBM + NVIDIA Sierra 2018 IBM + NVIDIA Perlmutter 2020 AMD + NVIDIA Aurora 2021 Intel + Intel Frontier 2021 AMD + AMD El Captain 2022 AMD + X § Heterogenous Computing (CPU + Accelerator) § Varying vendors 25 Three Pillars Simulation Data Learning HPC Languages Productivity Languages Productivity Languages Directives Big Data Stack DL Frameworks Parallel Runtimes Statistical Libraries Statistical Libraries Solver Libraries Databases Linear Algebra Libraries Compilers, Performance Tools, Debuggers Math Libraries, C++ Standard Library, libc I/O, Messaging Containers, Visualization Scheduler Linux Kernel, POSIX 26 PROGRAMMING MODELS Heterogenous System programming models § Applications will be using a variety of programming models for Exascale: § CUDA § OpenCL § HIP § OpenACC § OpenMP § DPC++/SYCL § Kokkos § Raja § Not all systems will support all models § Libraries may help you abstract some programming models. 28 Aurora Programming Models § Programming models available on Aurora: § CUDA § OpenCL § HIP § OpenACC § OpenMP § DPC++/SYCL § Kokkos § Raja 29 Mapping of existing programming models to Aurora Aurora Models OpenMP w/o target OpenMP with target OpenMP OpenACC Vendor Supported Programming Models OpenCL OpenCL MPI + CUDA DPC++/

MPI on Aurora

Petaflops for the People

GPU Developments 2018

1 Intel CEO Remarks Pat Gelsinger Q2'21 Earnings Webcast July 22

Intel 2019 Year Book

Computer Architectures an Overview

PC Technician Essentials PC Anatomy

Ushering in a New Era: Argonne National Laboratory & Aurora

Architectural Trade-Offs in a Latency Tolerant Gallium Arsenide Microprocessor

TOP500 Supercomputer Sites

Intel 8080 Oral History

History of Micro-Computers

The GPU Computing Revolution