AN OVERVIEW OF AURORA, ARGONNE’S UPCOMING EXASCALE SYSTEM

ALCF DEVELOPERS SESSION

COLLEEN BERTONI, SUDHEER CHUNDURI

www.anl.gov AURORA: An -Cray System

Intel/ machine arriving at Argonne in 2021 Sustained Performance greater than 1 Exaflops

2 AURORA: A High-level View

§ Hardware Architecture: § Intel processors and GPUs § Greater than 10 PB of total memory § Cray Slingshot network and Shasta platform § IO • Uses Lustre and Distributed Asynchronous Object Store IO (DAOS) • Greater than 230 PB of storage capacity and 25 TB/s of bandwidth

§ Software (Intel One API umbrella): § Intel compilers (C,C++,Fortran) § Programming models: DPC++, OpenMP, OpenCL § Libraries: oneMKL, oneDNN, oneDAL § Tools: VTune, Advisor § Python

3 Node-level Hardware The Evolution of Intel GPUs

Source: Intel

5 Intel GPUs

§ Intel Integrated GPUs are used for over a decade in § Laptops (e.g. MacBook pro) § Desktops § Servers § Recent and upcoming integrated generations : § Gen 9 – used in Skylake based nodes § Gen 11 – used in Ice Lake based nodes § Gen 9: Double precision peak performance: 100-300 GF § Low by design due to power and space limits Layout of Architecture components for an i7 processor 6700K for desktop systems (91 W TDP, 122 mm) § Future Intel Xe (Gen 12) GPU series will provide both integrated and discrete GPUs

6 Intel GPU Building Blocks

EU: Execution Unit Subslice L2 Slice: 24 EUs SIMD FPU Dispatch&I$ L1 Shared Local Memory (64 KB/subslice) SIMD FPU Subslice: 8 EUs Send 8x EU Sampler L2 $ L3 Data Cache Branch Dataport L1 $ (512 KB/slice) Thread Arbiter Thread RT InstructionFetch

Slice: 24 EUs Shared Local Memory (64 KB/subslice) Subslice: 8 EUs Sampler L2 $ L3 Data Cache L1 $ (512 KB/slice)

Slice: 24 EUs Shared Local Memory (64 KB/subslice) Subslice: 8 EUs Sampler L2 $ L3 Data Cache L1 $ (512 KB/slice) 7 Gen 9 IRIS (GT4) GPU Characteristics Characteristics Value Notes

Clock 1.15 GHz

Slices 3

EUs 72 3 slice * 3 sub-slices * 8 EUs

Hardware Threads 504 72 EUs * 7 threads

Concurrent Kernel Instances 16,128 504 threads * SIMD-32

L3 Data Cache Size 1.5 MB 3 slices * 0.5 MB/slice

Max Shared Local Memory 576 KB 3 slice * 3 sub-slices * 64 KB/sub-slice

Last Level Cache Size 8 MB

eDRAM size 128 MB

32b float FLOPS 1152 FLOPS/cycle 72 EUs * 2 FPUs * SIMD-4 * (MUL + ADD) 331.2 DP 64b float FLOPS 288 FLOPS/cycle 72 EUs * 1 FPU * SIMD-2 * (MUL + ADD) GFlops 32b integer IOPS 576 IOPS/cycle 72 EUs * 2 FPUs * SIMD-4 8 Aurora Compute Node § 2 Intel Xeon () Xe Xe processors

§ 6 Xe Architecture based GPUs Xe Xe (Ponte Vecchio)

Xe Xe § 8 Slingshot Fabric endpoints Slingshot Slingshot § Unified Memory Architecture CPU CPU across CPU & GPU DRAM DRAM § All-to-all connectivity within node § Low latency and high bandwidth

9 Network Interconnect Cray Slingshot Network

§ Dragonfly topology: Global links § 3-hop network topology § Groups connected globally all-to-all with optical cables Flattened Butterfly (Intra-group) § High bandwidth switch: Switch Switch Switch Switch Switch Switch Switch Switch § 64 ports at 200 Gb/s in each direction Nodes Nodes Nodes Nodes Nodes Nodes Nodes Nodes § Total bandwidth of 25.6 Tb/s per switch

11 Rosetta Switch HPC Features § QoS – Traffic Classes § Class: Collection of buffers, queues and bandwidth § Intended to provide isolation between applications via traffic shaping § Aggressive adaptive routing § Expected to be more effective for 3-hop dragonfly due to closer congestion information § Multi-level congestion management § To minimize the impact of congested applications on others Rosetta Switch § Very low average and tail latency (64 ports x 200 Gbps/dir) § High performance multicast and reductions 12 Slingshot Quality of Service Classes § Highly tunable QoS classes § Priority, ordering, routing biases, minimum bandwidth guarantees, maximum bandwidth Traffic type 1 constraint etc. § Supports multiple, overlaid virtual Traffic type 2 networks … Traffic type 3 § High priority compute § Standard compute § Low-latency control & synchronization Bandwidth sharing without QoS § Bulk I/O § Scavenger class background § Jobs can use multiple traffic classes Traffic class 1 § Provides performance isolation for different types of traffic Traffic class 2 § Smaller-size reductions can avoid getting Traffic class 3 stuck behind large messages § Less interference between compute and I/O Bandwidth sharing with QoS

13 Congestion Management § Hardware automatically tracks all outstanding packets § Knows what is flowing between every pair of endpoints § Quickly identifies and controls causes of congestion § Pushes back on sources (injection points) … just the right amount § Frees up buffer space for for other traffic § Other traffic not affected and can pass stalled traffic § Avoids HOL blocking across entire fabric § Fast and stable across a wide variety of traffic patterns § Suitable for dynamic HPC traffic § Performance isolation between application on same QoS class § Applications much less impacted by other traffic on the network – predictable runtimes § Lower mean and tail latency – a big benefit for apps with global synchronization

14 GPCNeT (Global Performance and Congestion Network Tests) § A new benchmark developed by Cray in collaboration with Argonne and NERSC § Publicly available at https://github.com/netbench/GPCNET

§ Goals: § Proxy for real-world application communication patterns § Measure network performance under load § Look at both mean and tail latency § Look at interference between workloads § How well does network perform Congestion management?

15 Congestion Impact in Real Systems

16 Congestion Impact in Real Systems

17 Congestion Impact in Real Systems

§ Infiniband does better than Aries

§ Impact worsens with scale and taper

18 Congestion Impact in Real Systems

§ Infiniband does better than Aries

§ Impact worsens with scale and taper

§ Slingshot does really well

19 MPI on Aurora § Intel MPI & Cray MPI - MPI 3.0 standard compliant § The MPI library will be thread safe and allow applications to use MPI from individual threads § Efficient MPI_THREAD_MUTIPLE (locking optimizations) § Asynchronous progress in all types of nonblocking communication § Nonblocking send-receive § Nonblocking collectives and § One-sided § Hardware optimized collective implementations § Topology optimized collective implementations § Supports MPI tools interface § Control variables

20 IO Aurora IO Subsystem Overview § Two-tiered approach to storage § DAOS provides high-performance storage for compute datasets and DAOS § Lustre provides traditional parallel file system Compute Groups checkpoints Nodes § Distributed Asynchronous Object Store (DAOS) NVM Storage Login and § Provides the main storage system for Aurora Compute Service • Greater than 230 PB of storage capacity nodes Slingshot DAOS<>Lustre/GPFS Nodes Dataset movement • Greater than 25 TB/s of bandwidth § New storage paradigm that provides significant performance improvements Libs, binaries, Gateway • Enable new data-centric workflows and namespace Nodes • Provides a POSIX wrapper with relaxed POSIX semantics § Lustre Infiniband GPFS Tier § Traditional PFS Lustre Tier HDD HDD Storage § Provides global namespace for the center Storage § Compatibility with legacy systems Pre-installed Storage § Storage for source code, binaries, and libraries § Storage of “cold” project data 22 I/O – User Perspective § User see single namespace which is in Lustre § “Links” point to DAOS containers within the /project directory § DAOS aware software interpret these links and access the DAOS containers Regular PFS directories & files § Data resides in a single place (Lustre or DAOS) HDF5 Container DAOS POSIX Container § Explicit data movement, no auto-migration /mnt DAOS MPI-IO Container § Users keep daos § source and binaries in Lustre lustre § Bulk data in DAOS PUUID1 PUUID2 users libs projects

CUUID2 CUUID3 Buzz mkl.so hdf5.so Apollo Gemini CUUID1 POSIX Container MPI-IO Container HDF5 root MPI-IO file Container dir dir dir MPI-IO MPI-IO .shrc moon.mpg data data data datadata datadata datadata file file Simul.h5 Result.dn Simul.out datafile datafile datafilet

EA: daos/hdf5://PUUID1/CUUID1 EA: daos/POSIX://PUUID2/CUUID2 EA: daos/MPI-IO://PUUID2/CUUID3 23 SOFTWARE Pre-exascale and Exascale US landscape

Delivery CPU + Accelerator Vendor 2018 IBM + NVIDIA

Sierra 2018 IBM + NVIDIA Perlmutter 2020 AMD + NVIDIA Aurora 2021 Intel + Intel

Frontier 2021 AMD + AMD

El Captain 2022 AMD + X

§ Heterogenous Computing (CPU + Accelerator) § Varying vendors

25 Three Pillars

Simulation Data Learning

HPC Languages Productivity Languages Productivity Languages Directives Big Data Stack DL Frameworks Parallel Runtimes Statistical Libraries Statistical Libraries Solver Libraries Databases Linear Algebra Libraries

Compilers, Performance Tools, Debuggers Math Libraries, C++ Standard Library, libc I/O, Messaging Containers, Visualization Scheduler Linux Kernel, POSIX

26 PROGRAMMING MODELS Heterogenous System programming models § Applications will be using a variety of programming models for Exascale: § CUDA § OpenCL § HIP § OpenACC § OpenMP § DPC++/SYCL § Kokkos § Raja § Not all systems will support all models § Libraries may help you abstract some programming models.

28 Aurora Programming Models § Programming models available on Aurora: § CUDA § OpenCL § HIP § OpenACC § OpenMP § DPC++/SYCL § Kokkos § Raja

29 Mapping of existing programming models to Aurora Aurora Models OpenMP w/o target

OpenMP with target OpenMP

OpenACC Vendor Supported Programming Models OpenCL OpenCL MPI + CUDA DPC++/ SYCL

Kokkos Kokkos ECP Provided Programming Models Raja Raja

30 OpenMP 5

§ OpenMP 5 constructs will provide directives based programming model for Intel GPUs § Available for C, C++, and Fortran § A portable model expected to be supported on a variety of platforms (Aurora, Frontier, Perlmutter, …) § Optimized for Aurora § For Aurora OpenACC codes could be converted into OpenMP § ALCF staff will assist with conversion, training, and best practices § Automated translation possible through the clacc conversion tool (for C/C++)

https://www.openmp.org/

31 OpenMP 5 Host Device Transfer data and execution control

Distributes iterations to the extern void init(float*, float*, int); threads, where each thread extern void output(float*, int); uses SIMD parallelism void vec_mult(float*p, float*v1, float*v2, int N) Creates { teams of int i; threads init(v1, v2, N); in the #pragma omp target teams distribute parallel for simd \ target map(to: v1[0:N], v2[0:N]) map(from: p[0:N]) device for (i=0; i

32 OpenMP 4.5/5: for Aurora § OpenMP 4.5/5 specification has significant updates to allow for improved support of accelerator devices

Offloading code to run on Distributing iterations of the loop Controlling data transfer between accelerator to threads devices

#pragma omp target [clause[[,] #pragma omp teams [clause[[,] map ([map-type:] list ) clause],…] clause],…] map-type:=alloc | tofrom | from | to | structured-block structured-block … #pragma omp declare target #pragma omp distribute [clause[[,] #pragma omp target data [clause[[,] declarations-definition-seq clause],…] clause],…] #pragma omp declare for-loops structured-block variant*(variant-func-id) clause new- #pragma omp loop* [clause[[,] #pragma omp target update [clause[[,] line clause],…] clause],…] function definition or declaration for-loops

Runtime support routines: Environment variables * denotes OMP 5 • void omp_set_default_device(int dev_num) • Control default device through • int omp_get_default_device(void) OMP_DEFAULT_DEVICE • int omp_get_num_devices(void) • Control offload with • int omp_get_num_teams(void) OMP_TARGET_OFFLOAD

33 DPC++ (Data Parallel C++) and SYCL

§ SYCL • Khronos standard specification • SYCL is a C++ based abstraction layer (standard C++11) SYCL 1.2.1 or later • Builds on OpenCL concepts (but single-source) • SYCL is designed to be as close to standard C++ as C++11 or possible later § Current Implementations of SYCL: • ComputeCPP™ (www.codeplay.com) • Intel SYCL (github.com/intel/llvm) • triSYCL (github.com/triSYCL/triSYCL) • hipSYCL (github.com/illuhad/hipSYCL) § Runs on today’s CPUs and nVidia, AMD, Intel GPUs § DPC++ § Part of Intel oneAPI specification § Intel extension of SYCL to support new innovative features § Incorporates SYCL 1.2.1 specification and Unified Shared Memory § Add language or runtime extensions as needed to meet user needs

34 DPC++ (Data Parallel C++) and SYCL

§ SYCL Intel DPC++ • Khronos standard specification • SYCL is a C++ based abstraction layer (standard C++11) SYCL 1.2.1 or later • Builds on OpenCL concepts (but single-source) • SYCL is designed to be as close to standard C++ as C++11 or possible later § Current Implementations of SYCL: • ComputeCPP™ (www.codeplay.com) Extensions Description • Intel SYCL (github.com/intel/llvm) Unified Shared defines pointer-based memory accesses • triSYCL (github.com/triSYCL/triSYCL) Memory (USM) and management interfaces. • hipSYCL (github.com/illuhad/hipSYCL) defines simple in-order semantics for § Runs on today’s CPUs and nVidia, AMD, Intel GPUs queues, to simplify common coding In-order queues patterns. § DPC++ provides reduction abstraction to the ND- § Part of Intel oneAPI specification Reduction range form of parallel_for. § Intel extension of SYCL to support new innovative features Optional lambda removes requirement to manually name § Incorporates SYCL 1.2.1 specification and Unified Shared name lambdas that define kernels. Memory defines a grouping of work-items within a Subgroups work-group. § Add language or runtime extensions as needed to meet user enables efficient First-In, First-Out (FIFO) needs Data flow pipes communication (FPGA-only) https://spec.oneapi.com/oneAPI/Elements/dpcpp/dpcpp_root.html#extensions-table 35 SYCL Examples Host Device Transfer data and execution control

Get a device void foo( int *A ) { default_selector selector; // Selectors determine which device to dispatch to. { SYCL buffer queue myQueue(selector); // Create queue to submit work to, based on selector using host // Wrap data in a sycl::buffer pointer buffer bufferA(A, 1024); myQueue.submit([&](handler& cgh) {

Data //Create an accessor for the sycl buffer. accessor auto writeResult = bufferA.get_access(cgh);

// Kernel Kernel cgh.parallel_for(range<1>{1024}, [=](id<1> idx) { writeResult[idx] = idx[0]; }); // End of the kernel function Queue out of }); // End of the queue commands } // End of scope, wait for the queued work to stop. scope ... } 36 OpenCL

§ Open standard for heterogeneous device programming (CPU, GPU, FPGA) § Utilized for GPU programming § Standardized by multi-vendor Khronos Group, V 1.0 released in 2009 § AMD, Intel, nVidia, … § Many implementations from different vendors § Intel implementation for GPU is Open Source (https://github.com/intel/compute-runtime) § SIMT programming model § Distributes work by abstracting loops and assigning work to threads § Not using pragmas / directives for specifying parallelism § Similar model to CUDA § Consists of a C compliant library with kernel language § Kernel language extends C § Has extensions that may be vendor specific § Programming aspects: § Requires host and device code be in different files § Typically uses JIT compilation § Example: https://github.com/alcf-perfengr/alcl

37 oneAPI § Industry specification from Intel (https://www.oneapi.com/spec/) § Language and libraries to target programming across diverse architectures (DPC++, APIs, low level interface)

§ Intel oneAPI products and toolkits (https://software.intel.com/ONEAPI) § Implementations of the oneAPI specification and analysis and debug tools to help programming

38 TOOLS AND LIBRARIES VTune and Advisor § Vtune Profiler § Widely used performance analysis tool § Currently supports analysis on Intel integrated GPUs § Will support future Intel GPUs § Advisor § Provides roofline analysis § Offload analysis will identify components for profitable offload • Measure performance and behavior of original code • Model specific accelerator performance to determine offload opportunities • Considers overhead from data transfer and kernel launch

40 Intel MKL

§ Highly tuned algorithms § Optimized for every Intel compute platform § oneAPI MKL (oneMKL)

41 AI and Analytics § Libraries to support AI and Analytics § OneAPI Deep Neural Network Library (oneDNN) • High Performance Primitives to accelerate deep learning frameworks • Powers Tensorflow, PyTorch, MXNet, Intel Caffe, and more • Running on Gen9 today (via OpenCL) § oneAPI Data Analytics Library (oneDAL) • Classical Machine Learning Algorithms • Easy to use one line daal4py Python interfaces • Powers Scikit-Learn § Apache Spark MLlib

42 § Try it now

TRY IT NOW! Current testbeds § Intel’s DevCloud § oneAPI Toolkits: https://software.intel.com/ONEAPI

§ Argonne’s Joint Laboratory for System Evaluation (JLSE) provides testbeds for Aurora – available today § 20 Nodes of Intel with Gen 9 Iris Pro [No NDA required] • SYCL, OpenCL, Intel VTune and Advisor are available. § Intel’s Aurora SDK that already includes many of the compilers, libraries, tools, and programming models planned for Aurora. [NDA required, ECP/ESP]

§ If you have workloads that are of great interest to you, please send them to us!

44 Current testbeds § Intel’s DevCloud § oneAPI Toolkits: https://software.intel.com/ONEAPI

§ Argonne’s Joint Laboratory for System Evaluation (JLSE) provides testbeds for Aurora – available today § 20 Nodes of Intel Xeons with Gen 9 Iris Pro [No NDA required] • SYCL, OpenCL, Intel VTune and Advisor are available. § Intel’s Aurora SDK that already includes many of the compilers, libraries, tools, and programming models planned for Aurora. [NDA required, ECP/ESP]

§ If you have workloads that are of great interest to you, please send them to us!

45 QUESTIONS? References § Intel Gen9 GPU Programmer’s Reference Manual: https://en.wikichip.org/w/images/3/31/intel-gfx-prm-osrc-skl-vol03-gpu_overview.pdf § , Intel Chief Architect, Keynote at Intel HPC Developer Conference, Nov 17, 2019. https://s21.q4cdn.com/600692695/files/doc_presentations/2019/11/DEVCON- 2019_16x9_v13_FINAL.pdf § DAOS: § DAOS Community Home: https://wiki.hpdd.intel.com/display/DC/DAOS+Community+Home § DAOS User Group meeting: https://wiki.hpdd.intel.com/display/DC/DUG19 § Slingshot § Hot Interconnects Keynote by Steve Scott, CTO, Cray: http://www.hoti.org/hoti26/slides/sscott.pdf § HPC User Forum talk by Steve Scott, Sep 2019 § https://www.nextplatform.com/2019/08/16/how-cray-makes-ethernet-suited-for-hpc-and-ai-with- slingshot/

47