Start Finish Topic Presenter 14:00 14:25 HPC processor landscape Andrea Bartolini, UNIBO Main challenges for HPC processor Architecture evolution towards heterogeneity Semiconductor technology overview European landscape and introduction to EuroHPC 14:25 European Processor Initiative (EPI) 14:25 14:35 Overview Andrea Bartolini, UNIBO 14:35 14:50 Processor and general architecture Andrea Bartolini, UNIBO 14:50 15:00 Co-design process and modeling Andrea Bartolini, UNIBO 15:00 15:45 Accelerator Mauro Olivieri, BSC 15:45 16:30 Software Jesus Labarta, BSC Jaume Abella, BSC 16:30 17:00 Automotive Francisco Cazorla, BSC 17:00 End
Start Finish Topic Presenter 14:00 14:25 HPC processor landscape Andrea Bartolini, UNIBO Main challenges for HPC processor Slides Prepared (Denis Dutoit) Architecture evolution towards heterogeneity Semiconductor technology overview European landscape and introduction to EuroHPC High Performance New drivers Requirements Solutions Computing New workloads More computing performance (Ops Heterogeneity: per second), also for simple Generic processing Compute operations (FP16, FP8, INT…). + accelerators Analyze Energy efficiency (Ops per Watt). Low power design Massive volume Increased Bytes per Flops. High Bandwidth of data High bandwidth/low latency access Memories and 2.5D to all data. integration Data in Data out
• Starting from high performance compute only, HPC evolves towards: • New workloads • Massive volume of data TERA1000 - CEA < 10x energy efficiency improvement every 4 years PERFORMANCE 100 EFLOPS x10 every 4 years 10 EFLOPS
1 EFLOPS ENERGY PER 100 PFLOPS OPERATION* 10 PFLOPS 2 nJ/FLOP
1 PFLOPS 200 pJ/FLOP
100 TFLOPS 20 pJ/FLOP
10 TFLOPS 2 pJ/FLOP
1 TFLOPS 0.2 pJ/FLOP /10 every 4 years
* assuming 20 MWatt
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 supercomputer 2000 Max. nb of transistors reached
Max. frequency reached
Max. power reached
Source from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç Max. frequency Transistor nb ↗ ↗ ↗ reached Frequency ↗ Power density →
CPU
Cache
Bus
Memory NIC (Network InterConnect)
Source from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç Max. power Transistor nb ↗ ↗ reached Frequency → Power density →
Cache Cache
NoC + LLC
Cache Cache
NIC Memory
Source from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç Today and next generation
Max. nb of Far NIC transistors reached Mem.
Generic processing High High Link Speed Close Close Mem. Mem Close Close Mem Mem
HW accelerator Transistor nb ↗ Frequency → Total power → Source from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç K / RIKEN, 2011 Japan (2020-2021) SPARC64 VIIIfx Fugaku / RIKEN 11.28 petaflops (peak) A64FX (Armv8.2+SVE) 10.51 petaflops >0.5 exaflops
Sunway TaihuLight /NRCPC (?) NRCPC Exa-prototype SW26010 ? China SW26010 based 125.43 petaflops (peak) ? (2020-2021) Tianhe-2 /NUDT, 2013 Tianhe-2a /NUDT, 2018 homogeneous Tianhe-3 / NUDT Intel Xeon + KNC Intel Xeon + Matrix-2000 Matrix-3000 33.86 petaflops (peak) 94.97 petaflops (peak) >1.0 exaflops (peak)
(?) Sugon Exa-prototype Hygon CPU + DCU Hygon CPU + DCU ?
US (2021) Aurora / ANL Summit / ORNL, 2019 Intel Xeon + Xe EPI takes 2-step approach IBM P9 + NVidia GPU >1.0 exaflops (peak) 200 petaflops (peak) step#1 : homogeneous with Arm core+SVE 148.6 petaflops (2021) step#2 : heterogeneous with additional EPI accelerators Sierra / LLNL, 2019 Frontier / ORNL IBM P9 + NVidia GPU AMD CPU + GPU 125 petaflops (peak) ~1.5 exaflops (peak) heterogeneous, accelerated Heterogeneous integration ? Source IBM Source NVIDIA
https://www.amd.com/es/products/frontier Source Fujitsu HotChips 2018
Source ScienceDirect
Planar FinFET Gate-all-around ?
32 nm 22 nm 14 nm 10 nm 7 nm 5 nm 3 nm
2010 2012 2014 2016 2018 2019 ~2021 Source Wikipedia, WikiChip Panasonic STM UMC IBM IBM GF GF GF Samsung Samsung Samsung Samsung Samsung TSMC TSMC TSMC TSMC TSMC Under announcement ? Intel Intel Intel Intel Intel 32 nm 22 nm 14 nm 10 nm 7 nm 5 nm 3 nm
2010 2012 2014 2016 2018 2019 ~2021 Source Wikipedia, WikiChip Advanced Integration
CHIPLET SiP 3D 2.5D partitioning Die stacking Multi-Chip-Module
Interconnect Interconnect density: 10µm x Interconnect density: 100µm x 100µm density: 10µm x 10µm 10µm Source: DARPA
Source: Micron High-Bandwidth-Memory Source: LETI
Source: AMD EPYC 7260, 4-chiplet chip
System-in-Package 3D Integrated-Circuit (3D IC)
Source: GeorgiaTech
(2017.10) EPYC 7260, 4-chiplet chip
(2018.11) EPYC “Rome”, 9-chiplet chip (2018.11) AMD Zen2 architecture;
Chip Design Manuf.
IBM POWER9
NVIDIA Volta GV100
Sunway SW26010
Intel Xeon E5
How to bring back Europe into Chip Design Manuf. processor race ? IBM POWER9
NVIDIA Volta GV100
Sunway SW26010
Intel Xeon E5
Start Finish Topic Presenter 14:25 European Processor Initiative (EPI) 14:25 14:35 Overview Andrea Bartolini, UNIBO 14:35 14:50 Processor and general architecture Andrea Bartolini, UNIBO 14:50 15:00 Co-design process and modeling Andrea Bartolini, UNIBO 15:00 15:45 Accelerator Mauro Olivieri, BSC 15:45 16:30 Software Jesus Labarta, BSC Jaume Abella, BSC; 16:30 17:00 Automotive Francisco Cazorla, BSC 17:00 End
EUROPEAN PROCESSOR INITIATIVE * Pre-ExaScale level with general-purpose CPU core in the first EPI GPP chip * Develop acceleration technologies for better DP GFLOPS/Watt performance * Inclusion of MPPA for real-time application acceleration * Develop a Common Platform to enable EPI accelerations
* Adopt Arm general-purpose CPU core with SVE / vector acceleration in the first EPI chip * Supply sufficient Memory Bandwidth (Byte/FLOP) to support the GPP application * in SGA1, focus on programming models to include accelerations. Codesign, Architecture, System software and key S1 - Common Stream technologies for the Common Platform Design and implement of the processor S2 - GPP Processor chip(s) and PoC system Foster acceleration technologies and S3 - Acceleration create building blocks
S4 - Automotive Address automotive market needs and create a pilot eHPC system
S5 - Administration Manage and support activities PCIe gen5 HSL links links
D2D links to adjacent chiplets ARM MPPA
eFPGA EPAC HBM memories
DDR memories
EPAC
VPU Bridge to GPP
STX
VRP
Bridge to GPP
EPI
Interposer / Package integration Silicon Die Power Security management & Controller
Armv8 EPAC ...... Core Accel. PCIe ports or High speed links integrated HBM2E Armv8 eFPGA ...... Stacks Core block MPPA ...... Accel. : : : : Die to Die DDR5/4 links channels ...... Memory-Coherent Network-on-Chip in 2D-mesh topology + Distributed system-level cache
NoC: network on chip HSL: High speed links (with memory coherent support) Power Management infrastructure Acceleration block #1 Interrupt network
AXI slave port AXI master port Armv8 CPU Acceleration Acceleration core with SVE block #1 block #2
NoC with SLC cache dataset shared by acceleration blocks dataset shared in memory
CU: Computing Unit; either Armv8 core with SVE or the EPAC/MPPA acceleration blocks ext. Memory SLC: System Level Cache; a last-level cache before ext. memories (HBM or DDR)
PCIe gen5 HSL links links
D2D links to adjacent chiplets
ARM MPPA
eFPGA EPAC HBM memories DDR memories
- Out-of-band – zero overhead - --PowerInLow band latencycap a.k.a=> PMMaxlow requests latencyperf @ P Operating System In band Governors DIMM RAS Node Power Cap GPP VRM S Out of band chip System Management RM / Management System RJ45 BMC Power PE Controller Intel IBM ARM AMD Cray Fujitsu Monitor S, M, A, T N, S, M, A, T, U S, M, T N, S, M, A, T N, S, M, A, N N, S, C, M (Domain,Gra 1ms 500us,10ms 1-10KHz with 1 sec (C ), OOB 1ms (N), nularity) aggregation SCP 1ms (G) (100ms) ~ns - model 16ms for T & based (C) U, 100ms EPI power management design is aggregation powered UNIBO and targets: Control S, M N, S, M, A S, M N, S, M, A N, S, M, A S, C, M, - Support for fine grain power (Domain,Gra RAPL 1ms 10-100ms 1-10KHz ~secs DVFS, RAPL, DVFS, monitoring, and control nularity) (in-band), (100ms to min-max Decode - An higher performance power DVFS 500us 1s) range, 10- Width, controller capable of supporting 30s at job HBM2 B/W launch advanced power control algorithms. Interfaces, RAPL MSRS, OpenBMC, ACPI, SCP Likwid, CapMC, Power API, Tools, etc msr-safe, amester, (sys ctrl PAPI, PAPI, Cray PAPI libmsr, PAPI, Memory Map proc), IPA Memory BMC likwid (intelligent Map interfaces allocator), Source PowerStack19 PAPI Socket (S), Core (C), Memory (M), Accelerator (G), Node (N), Utilization (U), Temperature (T) Security Infrastructure RoT Security services Security Domain #2 Security Domain #1 CU CU CU CU Adv. Cryptos CU CU CU CU S2 - GPP Processor S3 - Acceleration Architects S4 - Automotive Requirements within Streams Benchmarks Application + Experts Simulator, Eval. requirements Model and Modeling Eval. results Selection Criteria Full Applications Benchmark Mini-Apps Test Simulator Emulator platform Impact of design parameters on application performance Technical Processor Feedback constraints Design loop CREDITS: -P.Petrakis, V. Papaefstathiou et al. (FORTH): simulation execution an analysis -B.Brank, S.Nassyr (FZJ): BLIS micro-kernel -A.Portero (FZJ): Gem5 simulator setup Binary OpenMP MPI Binary Runtime Call Instrumen- System Instrumen- tation Plugin tation MPI Task / chunk Calls Dynamic creation events, instructions dependencies Trace