20190717 EPI Tutorial Andrea Bartolini

Start Finish Topic Presenter 14:00 14:25 HPC processor landscape Andrea Bartolini, UNIBO Main challenges for HPC processor Architecture evolution towards heterogeneity Semiconductor technology overview European landscape and introduction to EuroHPC 14:25 European Processor Initiative (EPI) 14:25 14:35 Overview Andrea Bartolini, UNIBO 14:35 14:50 Processor and general architecture Andrea Bartolini, UNIBO 14:50 15:00 Co-design process and modeling Andrea Bartolini, UNIBO 15:00 15:45 Accelerator Mauro Olivieri, BSC 15:45 16:30 Software Jesus Labarta, BSC Jaume Abella, BSC 16:30 17:00 Automotive Francisco Cazorla, BSC 17:00 End

Start Finish Topic Presenter 14:00 14:25 HPC processor landscape Andrea Bartolini, UNIBO Main challenges for HPC processor Slides Prepared (Denis Dutoit) Architecture evolution towards heterogeneity Semiconductor technology overview European landscape and introduction to EuroHPC High Performance New drivers Requirements Solutions Computing New workloads More computing performance (Ops Heterogeneity: per second), also for simple Generic processing Compute operations (FP16, FP8, INT…). + accelerators Analyze Energy efficiency (Ops per Watt). Low power design Massive volume Increased Bytes per Flops. High Bandwidth of data High bandwidth/low latency access Memories and 2.5D to all data. integration Data in Data out

• Starting from high performance compute only, HPC evolves towards: • New workloads • Massive volume of data TERA1000 - CEA < 10x energy efficiency improvement every 4 years PERFORMANCE 100 EFLOPS x10 every 4 years 10 EFLOPS

1 EFLOPS ENERGY PER 100 PFLOPS OPERATION* 10 PFLOPS 2 nJ/FLOP

1 PFLOPS 200 pJ/FLOP

100 TFLOPS 20 pJ/FLOP

10 TFLOPS 2 pJ/FLOP

1 TFLOPS 0.2 pJ/FLOP /10 every 4 years

* assuming 20 MWatt

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 supercomputer 2000 Max. nb of transistors reached

Max. frequency reached

Max. power reached

Source from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç Max. frequency Transistor nb ↗ ↗ ↗ reached Frequency ↗ Power density →

CPU

Cache

Bus

Memory NIC (Network InterConnect)

Source from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç Max. power Transistor nb ↗ ↗ reached Frequency → Power density →

Cache Cache

NoC + LLC

Cache Cache

NIC Memory

Source from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç Today and next generation

Max. nb of Far NIC transistors reached Mem.

Generic processing High High Link Speed Close Close Mem. Mem Close Close Mem Mem

HW accelerator Transistor nb ↗ Frequency → Total power → Source from Kunle Olukotun, Lance Hammond, Herb Sutter, Burton Smith, Chris Batten, and Krste Asanoviç K / RIKEN, 2011 Japan (2020-2021) SPARC64 VIIIfx Fugaku / RIKEN 11.28 petaflops (peak) A64FX (Armv8.2+SVE) 10.51 petaflops >0.5 exaflops 

Sunway TaihuLight /NRCPC (?)  NRCPC Exa-prototype SW26010 ? China SW26010 based  125.43 petaflops (peak) ? (2020-2021) Tianhe-2 /NUDT, 2013 Tianhe-2a /NUDT, 2018 homogeneous  Tianhe-3 / NUDT Intel Xeon + KNC Intel Xeon + Matrix-2000 Matrix-3000 33.86 petaflops (peak) 94.97 petaflops (peak) >1.0 exaflops (peak)

(?) Sugon Exa-prototype Hygon CPU + DCU Hygon CPU + DCU ?

US (2021) Aurora / ANL Summit / ORNL, 2019 Intel Xeon + Xe EPI takes 2-step approach IBM P9 + NVidia GPU >1.0 exaflops (peak) 200 petaflops (peak) step#1 : homogeneous with Arm core+SVE 148.6 petaflops (2021) step#2 : heterogeneous with additional EPI accelerators Sierra / LLNL, 2019 Frontier / ORNL IBM P9 + NVidia GPU AMD CPU + GPU 125 petaflops (peak) ~1.5 exaflops (peak) heterogeneous, accelerated Heterogeneous integration ? Source IBM Source NVIDIA

 https://www.amd.com/es/products/frontier   Source Fujitsu HotChips 2018

 Source ScienceDirect

Planar FinFET Gate-all-around ?

32 nm 22 nm 14 nm 10 nm 7 nm 5 nm 3 nm

2010 2012 2014 2016 2018 2019 ~2021 Source Wikipedia, WikiChip Panasonic STM UMC IBM IBM GF GF GF Samsung Samsung Samsung Samsung Samsung TSMC TSMC TSMC TSMC TSMC Under announcement ? Intel Intel Intel Intel Intel 32 nm 22 nm 14 nm 10 nm 7 nm 5 nm 3 nm

2010 2012 2014 2016 2018 2019 ~2021 Source Wikipedia, WikiChip Advanced Integration

CHIPLET SiP 3D 2.5D partitioning Die stacking Multi-Chip-Module

Interconnect Interconnect density: 10µm x Interconnect density: 100µm x 100µm density: 10µm x 10µm 10µm Source: DARPA

Source: Micron High-Bandwidth-Memory Source: LETI

Source: AMD EPYC 7260, 4-chiplet chip

System-in-Package 3D Integrated-Circuit (3D IC)

Source: GeorgiaTech 

(2017.10) EPYC 7260, 4-chiplet chip 

(2018.11) EPYC “Rome”, 9-chiplet chip (2018.11) AMD Zen2 architecture;         



Chip Design Manuf.

IBM POWER9

NVIDIA Volta GV100

Sunway SW26010

Intel Xeon E5    

How to bring  back Europe into Chip Design Manuf. processor race ? IBM POWER9

NVIDIA Volta GV100

Sunway SW26010

Intel Xeon E5 



 



Start Finish Topic Presenter 14:25 European Processor Initiative (EPI) 14:25 14:35 Overview Andrea Bartolini, UNIBO 14:35 14:50 Processor and general architecture Andrea Bartolini, UNIBO 14:50 15:00 Co-design process and modeling Andrea Bartolini, UNIBO 15:00 15:45 Accelerator Mauro Olivieri, BSC 15:45 16:30 Software Jesus Labarta, BSC Jaume Abella, BSC; 16:30 17:00 Automotive Francisco Cazorla, BSC 17:00 End

    EUROPEAN PROCESSOR INITIATIVE            * Pre-ExaScale level with general-purpose CPU core in the first EPI GPP chip * Develop acceleration technologies for better DP GFLOPS/Watt performance * Inclusion of MPPA for real-time application acceleration * Develop a Common Platform to enable EPI accelerations

 * Adopt Arm general-purpose CPU core with SVE / vector acceleration in the first EPI chip * Supply sufficient Memory Bandwidth (Byte/FLOP) to support the GPP application * in SGA1, focus on programming models to include accelerations. Codesign, Architecture, System software and key S1 - Common Stream technologies for the Common Platform Design and implement of the processor S2 - GPP Processor chip(s) and PoC system Foster acceleration technologies and S3 - Acceleration create building blocks

S4 - Automotive Address automotive market needs and create a pilot eHPC system

S5 - Administration Manage and support activities PCIe gen5 HSL links links

D2D links to adjacent chiplets ARM MPPA

eFPGA EPAC HBM memories 

  DDR memories 

EPAC 

VPU  Bridge to GPP

STX 

VRP

Bridge to GPP   

    



  EPI 



 Interposer / Package integration Silicon Die Power  Security management & Controller

 Armv8 EPAC ...... Core Accel. PCIe ports  or High speed links integrated HBM2E Armv8 eFPGA ......  Stacks Core block  MPPA ......  Accel. : : : : Die to Die  DDR5/4 links channels ......  Memory-Coherent Network-on-Chip in 2D-mesh topology  + Distributed system-level cache

NoC: network on chip HSL: High speed links (with memory coherent support) Power Management infrastructure Acceleration block #1  Interrupt network 

 AXI slave port AXI master port   Armv8 CPU Acceleration Acceleration  core with SVE block #1 block #2



  NoC with SLC cache dataset shared by acceleration blocks dataset shared in memory

CU: Computing Unit; either Armv8 core with SVE or the EPAC/MPPA acceleration blocks ext. Memory SLC: System Level Cache; a last-level cache before ext. memories (HBM or DDR) 

PCIe gen5 HSL  links links

D2D links to adjacent chiplets

ARM MPPA

eFPGA EPAC HBM memories    DDR   memories

 - Out-of-band – zero overhead - --PowerInLow band latencycap a.k.a=> PMMaxlow requests latencyperf @ P Min Energy(power, @ perf f=f*, …) -- SensorsNode Pcap(PVT,– MaxUtil, perfarchi)@ Pnode F > Fmaxrun@-time T,P

Operating System In band Governors DIMM

RAS Node Power Cap GPP VRM S

Out of band chip System Management RM / Management System RJ45 BMC Power PE Controller Intel IBM ARM AMD Cray Fujitsu Monitor S, M, A, T N, S, M, A, T, U S, M, T N, S, M, A, T N, S, M, A, N N, S, C, M (Domain,Gra 1ms 500us,10ms 1-10KHz with 1 sec (C ), OOB 1ms (N), nularity) aggregation SCP 1ms (G) (100ms) ~ns - model 16ms for T & based (C) U, 100ms EPI power management design is aggregation powered UNIBO and targets: Control S, M N, S, M, A S, M N, S, M, A N, S, M, A S, C, M, - Support for fine grain power (Domain,Gra RAPL 1ms 10-100ms 1-10KHz ~secs DVFS, RAPL, DVFS, monitoring, and control nularity) (in-band), (100ms to min-max Decode - An higher performance power DVFS 500us 1s) range, 10- Width, controller capable of supporting 30s at job HBM2 B/W launch advanced power control algorithms. Interfaces, RAPL MSRS, OpenBMC, ACPI, SCP Likwid, CapMC, Power API, Tools, etc msr-safe, amester, (sys ctrl PAPI, PAPI, Cray PAPI libmsr, PAPI, Memory Map proc), IPA Memory BMC likwid (intelligent Map interfaces allocator), Source PowerStack19 PAPI Socket (S), Core (C), Memory (M), Accelerator (G), Node (N), Utilization (U), Temperature (T)  

  

  Security Infrastructure 

 RoT  Security services  Security Domain #2 Security Domain #1  CU CU CU CU Adv. Cryptos 

CU CU CU CU

S2 - GPP Processor S3 - Acceleration Architects S4 - Automotive Requirements within Streams Benchmarks Application + Experts Simulator, Eval. requirements Model and Modeling Eval. results Selection Criteria

Full  Applications    Benchmark Mini-Apps 

 Test Simulator Emulator  platform   Impact of design parameters on  application performance    Technical Processor Feedback constraints Design loop                  CREDITS:

-P.Petrakis, V. Papaefstathiou et al. (FORTH): simulation execution an analysis -B.Brank, S.Nassyr (FZJ): BLIS micro-kernel -A.Portero (FZJ): Gem5 simulator setup   



       

  

 

 Binary  OpenMP MPI Binary Runtime Call  Instrumen- System Instrumen- tation Plugin tation  MPI Task / chunk  Calls Dynamic creation events, instructions  dependencies Trace  



 

