An Overview of Aurora, Argonne's Upcoming Exascale System
Total Page:16
File Type:pdf, Size:1020Kb
An Overview of Aurora, Argonne’s Upcoming Exascale System Argonne Leadership Computing Facility and Computational Science Division ECP Annual Meeting, Houston, Feb. 3-7 Yasaman Ghadar, Tim Williams www.anl.gov Argonne staff that can help with your question Yasaman Ghadar Tim Williams Kevin Sudheer Application Porting Early Science Harms Chunduri EXXALT Project I/O Libraries Network WDMApp DAOS Performance SlingShot JaeHyuk Kwack Colleen Bertoni Thomas Performance Ray Loy Application Porting Applencourt Tools Tools Programming Application Porting OpenMP Debugger Models SYCL/DPC++ Kokkos E3SM-MMF GAMESS QMCPACK ExaWind Brian Homerding Kris Rowe James Osborn Programming Saumil Patel I/O Libraries LatticeQCD Models ExaSMR ExaSMR Raja/Kokkos EQSIM Victor Silvio Rizzi Servesh Anisimov ALPINE Programing Kevin Brown Muralidharan Models Network performance Performance NWChemEx Hardware Opt. InteGration ExaFEL 2 Outline q Hardware Architecture: (Yasaman Ghadar) q Node-level Hardware q Interconnect q I/O q Aurora’s Testbed q Software Stack: (Tim Williams) q Three pillars q Programming models q Tools and libraries q Development platforms q Support for ECP teams q Further details on Aurora can be discussed under NDA q If you are not covered by an Aurora NDA please talk to your ECP project PI 3 Aurora: A High-level View q Intel-Cray machine arriving at Argonne in 2021 q Sustained Performance > 1Exaflops q Intel Xeon processors and Intel Xe GPUs q 2 Xeons (Sapphire Rapids) q 6 GPUs (Ponte Vecchio [PVC]) q Greater than 10 PB of total memory q Cray Slingshot fabric and Shasta platform q Filesystem q Distributed Asynchronous Object Store (DAOS) q ≥ 230 PB of storage capacity q Bandwidth of > 25 TB/s q Lustre q 150 PB of storage capacity q Bandwidth of ~1TB/s 4 The Evolution of Intel GPUs Source: Intel 5 Intel GPUs q Recent integrated generations: q Gen9 – used in Skylake q Gen11 – used in Ice Lake qGen9: Double precision peak performance: 100-300 GF q Low by design due to power and space limits qUpcoming Xe GPU q Discrete Layout of Architecture components for an Intel Core i7 processor 6700K for desktop systems (91 W TDP, 122 mm) 6 Intel Discrete Graphics 1 (DG1): q Intel Xe DG1: Software Development Vehicle(SDV) Shown At CES 2020 q Xe DG1 GPU is part of the low power (LP) series and is not a high-end product. q Chip will be used in laptops with GPU integrated on to the motherboard q PCIe form factor is only for the SDV and not a product q The SDV is shipping worldwide to integrated software vendors (ISVs) to enable broad software optimization for the Xe architecture. https://newsroom.intel.com/image-archive/images-intel-dgi-sdv/?wapkw=DG1#gs.u0a8t8 7 Aurora Compute Node q2 Intel Xeon (Sapphire Rapids) processors q6 Xe Architecture based GPUs (Ponte Vecchio) q All to all connection q Low latency and high bandwidth q8 Slingshot Fabric endpoints qUnified Memory Architecture across CPUs and GPUs 8 Network Interconnect Cray Shasta System qSupport diversity of processors qScale-optimized cabinets for density, cooling, and high network bandwidth qStandard cabinets for flexibility qCray SW stack with improved modularity qUnified by a single, high-performance Optimized Rack Standard Rack interconnect Shasta Hardware Workshop, May 6, 2019, Cray User Meeting 10 Cray Slingshot Network qSlingshot is next generation scalable interconnect by Cray q 8th major generation qBuilds on Cray’s expertise in high performance network following q Gemini (Titan, Blue Waters) q Aries (Theta, Cori) q 5 hop dragonfly topology qSlingshot introduces: q Congestion management q Traffic classes q 3 hop dragonfly Cray Slingshot https://www.cray.com/products/computing/slingshot https://www.cray.com/resources/slingshot-interconnect-for-exascale-era 11 Dragonfly Topology qSeveral groups are connected together using all to all links. qOptical cables are used for the long links between groups. qLow-cost electrical links are used to connect the NICs in each node to their local router and the routers in a group AllAll-to-all to All … Switch Switch Switch Switch Switch Switch Switch Switch Switch Switch Switch Switch … … … … Nodes Nodes Nodes Nodes Nodes Nodes Nodes Nodes Nodes Nodes Nodes Nodes GroupGroup 00 GroupGroup 1 1 GroupGroup 2 2 GroupGroup Z n https://www.cray.com/sites/default/files/resources/CrayXCNetwork.pdf 12 High Bandwidth Switch: Rosetta q Multi-level congestion management q To minimize the impact of congested applications on others q Very low average and tail latency qQuality of Service (QoS) – Traffic Classes q Class: Collection of buffers, queues and bandwidth q Intended to provide isolation between applications via traffic shaping q Aggressive adaptive routing q Expected to be more effective for 3-hop dragonfly due to closer congestion information Rosetta Switch q High performance multicast and reductions 25.6 Tb/s per switch, from 64 - 200 Gbs ports (25GB/s per direction) 13 Congestion Management with Slingshot q GPCNeT benchmark suite was utilized 14 GPCNeT (Global Performance and Congestion Network Tests) qA new benchmark developed by Cray in collaboration with Argonne and NERSC qPublicly available at https://github.com/netbench/GPCNET qGoals: q Proxy for real-world application communication patterns q Measure network performance under load q Look at both mean and tail latency q Look at interference between workloads q What is the Congestion management? q How well the system performs under congestion 15 Congestion Management with Slingshot q GPCNeT benchmark suit was utilized q Tail latency was measured at: q congested q isolated https://dl.acm.org/doi/pdf/10.1145/3295500.3356215 16 Congestion Management with Slingshot q GPCNeT benchmark suit was utilized qTail latency was measured at: q congested q isolated qBenchmark was run over the full machine for multiple runs qCongestion impact: q Latency congested/Latency isolated https://dl.acm.org/doi/pdf/10.1145/3295500.3356215 17 Congestion Management with Slingshot q GPCNeT benchmark suit was utilized Aries Network 10000 InfiniBand CI(99% Tail Latency) q Tail latency was measured at: CI(99% All Reduce) q congested q isolated 1000 q Benchmark was run over the full machine for multiple runs qCongestion impact: 100 q Latency congested/Latency isolated q Aries congestion impact is in order 10 of 1000x q InfiniBand has reduced congestion impact (20-1000x) scale) (log Impact Congestion 1 q Slingshot testbed (Malbec) has Crystal Theta Edison Osprey Summit Sierra Malbec minimal congestion impact https://dl.acm.org/doi/pdf/10.1145/3295500.3356215 18 Slingshot Quality of Service Classes § Highly tunable QoS classes Traffic type 1 § Supports multiple, overlaid virtual Traffic type 2 networks … Traffic type 3 § Jobs can use multiple traffic classes Bandwidth sharing without QoS § Provides performance isolation for different types of traffic Traffic class 1 Traffic class 2 Traffic class 3 Bandwidth sharing with QoS 19 I/O Distributed Asynchronous Object Store (DAOS) qOpen source storage solution qOffers high performance in bandwidth and IO operations q ≥ 230 PB capacity q ≥ 25 TB/s q Using DAOS is critical to achieving good I/O performance on Aurora qProvides compatibility with existing I/O models such as POSIX, MPI-IO and HDF5 qProvides a flexible storage API that enables new I/O paradigms DAOS (Distributed Asynchronous Object Storage) for Applications; Thursday 8:30-10:00AM 21 User Environment q User see single storage namespace which is in Lustre Compatibility with legacy systems q Links point to DAOS containers within the /project directory q DAOS aware software interpret these links and access the DAOS containers qData resides in a single place (Lustre or DAOS) q Explicit data movement, no auto-migration qSuggested storage locations q Source and binaries in Lustre q Bulk data in DAOS DAOS (Distributed Asynchronous Object Storage) for Applications; Thursday 8:30-10:00AM 22 Summary of Aurora Hardware Features qIntel/Cray machine arriving at Argonne in 2021 with sustained performance > 1 Exaflops qAurora node architecture q2 Intel Xeon (Sapphire Rapids) processors q6 Xe Architecture based GPUs (Ponte Vecchio) q8 Slingshot Fabric endpoints qUnified Memory Architecture across CPUs and GPUs qCray Sling Shot Network and Shasta platform q Highly tunable congestion management, traffic classes qI/O q DAOS (≥ 230 PB capacity and ≥ 25 TB/s) q Lustre (150 PB of storage and ~ 1TB/s) 23 Aurora Testbed Aurora Testbed qJLSE provides testbeds for Aurora – available today q 20 Nodes of Intel Xeons with Gen9 Iris Pro integrated GPU q Intel’s Aurora SDK that already includes many of the compilers, libraries, tools, and programming models planned for Aurora. [NDA required] q OpenMP q DPC++/SYCL q OpenCL q Intel VTune and Advisor q If you’d like to develop your application on the integrated Gen9 GPU, request access by sending email Joint Laboratory for System Evaluation to [email protected]. (JLSE) https://jlse.anl.gov q Note your home institution needs a CNDA agreement with Intel. 25 Intel Gen9 Integrated Graphics qCPUs and GPUs are integrated on same chip: q GPU and CPU same memory q GPU, cores, and memory are connected by an internal ring interconnect q Shared Last Level Cache IDf-2015 release of “The Compute Architecture of Intel Processor Graphics Gen9” – Stephen Junkins 26 Intel Gen9 Hierarchy q GPU architectures have hardware hierarchies q Built out of smaller scalable units, each level shares a set of resources Slice Sub-Slice Execution Unit (EU) 27 Intel Gen9 Building Blocks: EU EU: Execution Unit q These are compute processors that executes